Job title: Senior Inference Engineer
Job type: Permanent
Emp type: Full-time
Industry: Conversational AI
Skills: Inference Model Serving vLLM GPU optimisation
Salary type: Annual
Salary: negotiable
Location: Remote
Job published: 03/03/2026
Job ID: 35239

Job Description

Want to own the inference layer behind millions of real-world voice AI interactions every day?

You’ll join a profitable, founder-led enterprise conversational AI company powering billions of interactions annually across 30+ languages. Their systems sit behind major global brands and handle millions of customer conversations daily.

They’re now moving toward end-to-end multimodal and speech-to-speech architectures. You’ll own the inference stack powering both their multimodal speech-text LLM and their text reasoning LLM.

This goes well beyond tuning configs.

You will:

• Optimise production inference across A10, A100 and H100 GPUs
• Own scheduler design, KV cache allocation and batching logic
• Build serving systems tailored to multimodal audio-text workloads
• Support agentic, multi-step reasoning under real latency constraints
• Profile kernel-level bottlenecks and fix them properly

You’ve modified inference framework internals before, not just used them. You’re comfortable in Python and C++, and you’re happy diving into CUDA graphs, memory bandwidth limits or custom kernels when required.

This platform processes over 2 million interactions per day. Latency, throughput and cost are production realities, not lab metrics.

Package: €150,000 base + bonus + stock options
Location: Remote within Europe

If you want full ownership of inference performance at real enterprise scale, let’s talk.

All applicants will receive a response.