Job Description
GPU Optimisation Engineer — Real-Time Inference
Want to push GPU performance to its limits — not in theory, but in production systems handling real-time speech and multimodal workloads?
This team is building low-latency AI systems where milliseconds actually matter. The target isn’t “faster than baseline.” It’s sub-50ms time-to-first-token at 100+ concurrent requests on a single H100 — while maintaining model quality.
They’re hiring a GPU Optimisation Engineer who understands GPUs at an architectural level. Someone who knows where performance is really lost: memory hierarchy, kernel launch overhead, occupancy limits, scheduling inefficiencies, KV cache behaviour, attention paths. The work sits close to the metal, inside inference execution — not general infra, not model research.
You’ll operate across the kernel and runtime layers, profiling large-scale speech and multimodal models end-to-end and removing bottlenecks wherever they appear.
What you’ll work on
-
Profiling GPU bottlenecks across memory bandwidth, kernel fusion, quantisation, and scheduling
-
Writing and tuning custom CUDA / Triton kernels for performance-critical paths
-
Improving attention, decoding, and KV cache efficiency in inference runtimes
-
Modifying and extending vLLM-style systems to better suit real-time workloads
-
Optimising models to fit GPU memory constraints without degrading output quality
-
Benchmarking across NVIDIA GPUs (with exposure to AMD and other accelerators over time)
-
Partnering directly with research to turn new model ideas into fast, production-ready inference
This is hands-on optimisation work across the stack. No layers of bureaucracy. No “platform ownership” theatre. Just deep performance engineering applied to models that are actively evolving.
What tends to work well
-
Strong experience with CUDA and/or Triton
-
Deep understanding of GPU execution (memory hierarchy, scheduling, occupancy, concurrency)
-
Experience optimising inference latency and throughput for large generative models
-
Familiarity with attention kernels, decoding paths, or LLM-style runtimes
-
Comfort profiling with low-level GPU tooling
The company is revenue-generating, its models are used by global enterprises, and the SF R&D team is expanding following a recent raise. This is growth hiring, not backfill.
Package & location
-
Base salary: up to ~$300,000 (negotiable based on depth)
-
Equity: Meaningful stock
-
Location: San Francisco preferred (relocation and visa sponsorship can be provided)
If you care about real-time constraints, GPU architecture, and squeezing every last millisecond out of large models, this is worth a conversation.
All applicants will receive a response.