Job Description
Ready to drive next-generation performance for real-time speech AI?
Join a well-funded speech technology company that's rapidly establishing itself as a performance leader in real-time audio processing. Their platform already serves hundreds of enterprise clients across 100+ languages, consistently outperforming established competitors in both accuracy and latency.
With significant investment secured and a growing client base, they're now focusing on scaling their inference infrastructure to maintain their competitive edge. This creates an exceptional opportunity for an inference specialist to work with substantial GPU resources and make a direct impact on a platform that's already deployed at scale.
The role
As Senior Inference Engineer, you'll bridge cutting-edge research and production-ready speech systems. You'll take ownership of the GPU infrastructure, ensuring exceptional performance for real-time applications. Your work at the intersection of high-performance computing and speech processing will push the boundaries of what's possible with modern hardware.
What you'll do
- Design high-performance model serving solutions for speech recognition
- Develop encoder optimisation strategies including sliding window approaches
- Implement batching strategies to maximise throughput
- Create advanced model efficiency techniques through quantisation and pruning
- Establish robust benchmarking systems to improve latency metrics
- Develop internal tools for streamlined model deployment
- Research and implement emerging acceleration methods
What you'll bring
- Experience optimising complex models (seq2seq, speech/audio, real-time systems)
- Strong Python skills plus expertise in C++, CUDA, or Rust
- Hands-on experience with inference frameworks like Triton, TensorRT, or ONNX Runtime
- Experience with flash attention, CUDA, or equivalent optimisation techniques
- Deep understanding of GPU architecture and memory optimisation
- Track record implementing model quantisation for real-time production systems
Ideal additions
- Open-source inference project contributions (vLLM, TGI, Triton)
- Experience with ASR or audio processing workloads
- Background in high-performance computing with large GPU clusters
- Knowledge of model serving architectures at scale
Your package
- Competitive salary up to €150K depending on experience
- To work remotely or hybrid in Europe. You'll also get to meet periodically with for team gatherings in Europe.
- Comprehensive health coverage for you and your family
- Unlimited PTO
If you're passionate about squeezing every millisecond from speech models and want the technical freedom to implement cutting-edge optimisation techniques, this is your opportunity to make a significant impact.