Job Description
Machine Learning Engineer, Inference
Want to solve realtime inference problems where milliseconds genuinely matter?
This role is with a fast-growing voice AI company building the realtime speech infrastructure layer behind hundreds of millions of production conversations every month. Their systems power enterprise voice experiences used at massive scale across customer support, ordering, and conversational automation.
This is not another generic AI platform role focused on wrapping APIs or building dashboards.
The work here sits deep in the runtime stack, optimising realtime speech systems under production latency constraints. Think streaming inference, scheduler design, GPU utilisation, concurrency optimisation, dynamic batching, and making state-of-the-art speech models actually behave correctly in realtime environments.
You’ll join a lean engineering team working directly on the inference systems behind low-latency conversational speech models. The challenge is not simply generating outputs, it’s generating speech naturally, reliably, and fast enough for real human interaction.
Your work will include:
- Building and optimising realtime TTS streaming infrastructure
- Improving scheduler and batching systems for production workloads
- Reducing TTFA/TTFB while maintaining speech quality and stability
- GPU profiling and identifying kernel-level bottlenecks
- Optimising TensorRT, Triton, ONNX Runtime, and custom serving systems
- Managing KV cache systems, speculative decoding, and streaming inference
- Supporting heterogeneous deployment environments across NVIDIA and AMD GPUs
- Collaborating closely with model researchers to productionise cutting-edge speech systems
A large part of the role involves solving difficult runtime problems where latency consistency, concurrency, and throughput directly impact user experience. The team already operates beyond the performance of most publicly available realtime speech systems, but there’s still substantial room to push the infrastructure further.
You’ll likely have strong depth across inference systems, runtime optimisation, distributed serving, or GPU performance engineering. Experience with tools like TensorRT, Triton, vLLM, CUDA Graphs, ONNX Runtime, or custom schedulers would be highly valuable.
The environment suits engineers who naturally investigate bottlenecks, enjoy working close to hardware constraints, and care deeply about performance engineering. If reducing latency by 30ms feels meaningful, you’ll probably enjoy this team.
The stack includes Rust, C++, Python, CUDA, TensorRT, Triton, Kubernetes, AWS, and custom realtime inference infrastructure.
Compensation is highly competitive and flexible depending on experience, including strong salary, equity, and benefits.
Location: Remote across the US or Europe.
If you’re excited by realtime AI systems problems where optimisation work directly shapes production performance at scale, this would be worth exploring.
All applicants will receive a response.