Job title: Senior Inference Engineer
Job type: Permanent
Emp type: Full-time
Industry: Speech Technology
Salary type: Annual
Salary: negotiable
Location: Remote
Job published: 05/05/2025
Job ID: 33185

Job Description

Ready to drive next-generation performance for real-time speech AI?

Join a well-funded speech technology company that's rapidly establishing itself as a performance leader in real-time audio processing. Their platform already serves hundreds of enterprise clients across 100+ languages, consistently outperforming established competitors in both accuracy and latency.

With significant investment secured and a growing client base, they're now focusing on scaling their inference infrastructure to maintain their competitive edge. This creates an exceptional opportunity for an inference specialist to work with substantial GPU resources and make a direct impact on a platform that's already deployed at scale.

The role

As Senior Inference Engineer, you'll bridge cutting-edge research and production-ready speech systems. You'll take ownership of the GPU infrastructure, ensuring exceptional performance for real-time applications. Your work at the intersection of high-performance computing and speech processing will push the boundaries of what's possible with modern hardware.


What you'll do

  • Design high-performance model serving solutions for speech recognition
  • Develop encoder optimisation strategies including sliding window approaches
  • Implement batching strategies to maximise throughput
  • Create advanced model efficiency techniques through quantisation and pruning
  • Establish robust benchmarking systems to improve latency metrics
  • Develop internal tools for streamlined model deployment
  • Research and implement emerging acceleration methods

What you'll bring

  • Experience optimising complex models (seq2seq, speech/audio, real-time systems)
  • Strong Python skills plus expertise in C++, CUDA, or Rust
  • Hands-on experience with inference frameworks like Triton, TensorRT, or ONNX Runtime
  • Experience with flash attention, CUDA, or equivalent optimisation techniques
  • Deep understanding of GPU architecture and memory optimisation
  • Track record implementing model quantisation for real-time production systems

Ideal additions

  • Open-source inference project contributions (vLLM, TGI, Triton)
  • Experience with ASR or audio processing workloads
  • Background in high-performance computing with large GPU clusters
  • Knowledge of model serving architectures at scale

Your package

  • Competitive salary up to €150K depending on experience
  • To work remotely or hybrid in Europe. You'll also get to meet periodically with for team gatherings in Europe.
  • Comprehensive health coverage for you and your family
  • Unlimited PTO 

If you're passionate about squeezing every millisecond from speech models and want the technical freedom to implement cutting-edge optimisation techniques, this is your opportunity to make a significant impact.