Job Description
Want to own the inference layer behind millions of real-world voice AI interactions every day?
You’ll join a profitable, founder-led enterprise conversational AI company powering billions of interactions annually across 30+ languages. Their systems sit behind major global brands and handle millions of customer conversations daily.
They’re now moving toward end-to-end multimodal and speech-to-speech architectures. You’ll own the inference stack powering both their multimodal speech-text LLM and their text reasoning LLM.
This goes well beyond tuning configs.
You will:
• Optimise production inference across A10, A100 and H100 GPUs
• Own scheduler design, KV cache allocation and batching logic
• Build serving systems tailored to multimodal audio-text workloads
• Support agentic, multi-step reasoning under real latency constraints
• Profile kernel-level bottlenecks and fix them properly
You’ve modified inference framework internals before, not just used them. You’re comfortable in Python and C++, and you’re happy diving into CUDA graphs, memory bandwidth limits or custom kernels when required.
This platform processes over 2 million interactions per day. Latency, throughput and cost are production realities, not lab metrics.
Package: €150,000 base + bonus + stock options
Location: Remote within Europe
If you want full ownership of inference performance at real enterprise scale, let’s talk.
All applicants will receive a response.