Job title: GPU Optimization Engineer
Job type: Permanent
Emp type: Full-time
Industry: Generative AI
Salary type: Annual
Salary: negotiable
Location: San Francisco, CA
Job published: 11/02/2026
Job ID: 34843

Job Description

GPU Optimisation Engineer — Real-Time Inference

Want to push GPU performance to its limits — not in theory, but in production systems handling real-time speech and multimodal workloads?

This team is building low-latency AI systems where milliseconds actually matter. The target isn’t “faster than baseline.” It’s sub-50ms time-to-first-token at 100+ concurrent requests on a single H100 — while maintaining model quality.

They’re hiring a GPU Optimisation Engineer who understands GPUs at an architectural level. Someone who knows where performance is really lost: memory hierarchy, kernel launch overhead, occupancy limits, scheduling inefficiencies, KV cache behaviour, attention paths. The work sits close to the metal, inside inference execution — not general infra, not model research.

You’ll operate across the kernel and runtime layers, profiling large-scale speech and multimodal models end-to-end and removing bottlenecks wherever they appear.

What you’ll work on

  • Profiling GPU bottlenecks across memory bandwidth, kernel fusion, quantisation, and scheduling

  • Writing and tuning custom CUDA / Triton kernels for performance-critical paths

  • Improving attention, decoding, and KV cache efficiency in inference runtimes

  • Modifying and extending vLLM-style systems to better suit real-time workloads

  • Optimising models to fit GPU memory constraints without degrading output quality

  • Benchmarking across NVIDIA GPUs (with exposure to AMD and other accelerators over time)

  • Partnering directly with research to turn new model ideas into fast, production-ready inference

This is hands-on optimisation work across the stack. No layers of bureaucracy. No “platform ownership” theatre. Just deep performance engineering applied to models that are actively evolving.

What tends to work well

  • Strong experience with CUDA and/or Triton

  • Deep understanding of GPU execution (memory hierarchy, scheduling, occupancy, concurrency)

  • Experience optimising inference latency and throughput for large generative models

  • Familiarity with attention kernels, decoding paths, or LLM-style runtimes

  • Comfort profiling with low-level GPU tooling

The company is revenue-generating, its models are used by global enterprises, and the SF R&D team is expanding following a recent raise. This is growth hiring, not backfill.

Package & location

  • Base salary: up to ~$300,000 (negotiable based on depth)

  • Equity: Meaningful stock

  • Location: San Francisco preferred (relocation and visa sponsorship can be provided)

If you care about real-time constraints, GPU architecture, and squeezing every last millisecond out of large models, this is worth a conversation.

All applicants will receive a response.

Apply with indeed
File types (doc, docx, pdf, rtf, png, jpeg, jpg, bmp, jng, ppt, pptx, csv, gif) size up to 5MB