New Job Opening: Inference Engineer in San Francisco, CA

Job title:	Inference Engineer
Job type:	Permanent
Emp type:	Full-time
Industry:	Generative AI
Salary type:	Annual
Salary:	negotiable
Location:	San Francisco, CA
Job published:	21/05/2026
Job ID:	35998

Job Description

Machine Learning Engineer, Inference

Want to solve realtime inference problems where milliseconds genuinely matter?

This role is with a fast-growing voice AI company building the realtime speech infrastructure layer behind hundreds of millions of production conversations every month. Their systems power enterprise voice experiences used at massive scale across customer support, ordering, and conversational automation.

This is not another generic AI platform role focused on wrapping APIs or building dashboards.

The work here sits deep in the runtime stack, optimising realtime speech systems under production latency constraints. Think streaming inference, scheduler design, GPU utilisation, concurrency optimisation, dynamic batching, and making state-of-the-art speech models actually behave correctly in realtime environments.

You’ll join a lean engineering team working directly on the inference systems behind low-latency conversational speech models. The challenge is not simply generating outputs, it’s generating speech naturally, reliably, and fast enough for real human interaction.

Your work will include:

Building and optimising realtime TTS streaming infrastructure
Improving scheduler and batching systems for production workloads
Reducing TTFA/TTFB while maintaining speech quality and stability
GPU profiling and identifying kernel-level bottlenecks
Optimising TensorRT, Triton, ONNX Runtime, and custom serving systems
Managing KV cache systems, speculative decoding, and streaming inference
Supporting heterogeneous deployment environments across NVIDIA and AMD GPUs
Collaborating closely with model researchers to productionise cutting-edge speech systems

A large part of the role involves solving difficult runtime problems where latency consistency, concurrency, and throughput directly impact user experience. The team already operates beyond the performance of most publicly available realtime speech systems, but there’s still substantial room to push the infrastructure further.

You’ll likely have strong depth across inference systems, runtime optimisation, distributed serving, or GPU performance engineering. Experience with tools like TensorRT, Triton, vLLM, CUDA Graphs, ONNX Runtime, or custom schedulers would be highly valuable.

The environment suits engineers who naturally investigate bottlenecks, enjoy working close to hardware constraints, and care deeply about performance engineering. If reducing latency by 30ms feels meaningful, you’ll probably enjoy this team.

The stack includes Rust, C++, Python, CUDA, TensorRT, Triton, Kubernetes, AWS, and custom realtime inference infrastructure.

Compensation is highly competitive and flexible depending on experience, including strong salary, equity, and benefits.

Location: Remote across the US or Europe.

If you’re excited by realtime AI systems problems where optimisation work directly shapes production performance at scale, this would be worth exploring.

All applicants will receive a response.

Apply with indeed

Upload Resume | Portfolio

File types (doc, docx, pdf, rtf, png, jpeg, jpg, bmp, jng, ppt, pptx, csv, gif) size up to 5MB

First name

Last name

Phone number

Location

By checking this box, you agree to our Terms of Service

Job Description

Our use of cookies