New Job Opening: GPU Optimization Engineer in San Francisco, CA

Job title:	GPU Optimization Engineer
Job type:	Permanent
Emp type:	Full-time
Industry:	Generative AI
Salary type:	Annual
Salary:	negotiable
Location:	San Francisco, CA
Job published:	11/02/2026
Job ID:	34843

Job Description

GPU Optimisation Engineer — Real-Time Inference

Want to push GPU performance to its limits — not in theory, but in production systems handling real-time speech and multimodal workloads?

This team is building low-latency AI systems where milliseconds actually matter. The target isn’t “faster than baseline.” It’s sub-50ms time-to-first-token at 100+ concurrent requests on a single H100 — while maintaining model quality.

They’re hiring a GPU Optimisation Engineer who understands GPUs at an architectural level. Someone who knows where performance is really lost: memory hierarchy, kernel launch overhead, occupancy limits, scheduling inefficiencies, KV cache behaviour, attention paths. The work sits close to the metal, inside inference execution — not general infra, not model research.

You’ll operate across the kernel and runtime layers, profiling large-scale speech and multimodal models end-to-end and removing bottlenecks wherever they appear.

What you’ll work on

Profiling GPU bottlenecks across memory bandwidth, kernel fusion, quantisation, and scheduling
Writing and tuning custom CUDA / Triton kernels for performance-critical paths
Improving attention, decoding, and KV cache efficiency in inference runtimes
Modifying and extending vLLM-style systems to better suit real-time workloads
Optimising models to fit GPU memory constraints without degrading output quality
Benchmarking across NVIDIA GPUs (with exposure to AMD and other accelerators over time)
Partnering directly with research to turn new model ideas into fast, production-ready inference

This is hands-on optimisation work across the stack. No layers of bureaucracy. No “platform ownership” theatre. Just deep performance engineering applied to models that are actively evolving.

What tends to work well

Strong experience with CUDA and/or Triton
Deep understanding of GPU execution (memory hierarchy, scheduling, occupancy, concurrency)
Experience optimising inference latency and throughput for large generative models
Familiarity with attention kernels, decoding paths, or LLM-style runtimes
Comfort profiling with low-level GPU tooling

The company is revenue-generating, its models are used by global enterprises, and the SF R&D team is expanding following a recent raise. This is growth hiring, not backfill.

Package & location

Base salary: up to ~$300,000 (negotiable based on depth)
Equity: Meaningful stock
Location: San Francisco preferred (relocation and visa sponsorship can be provided)

If you care about real-time constraints, GPU architecture, and squeezing every last millisecond out of large models, this is worth a conversation.

All applicants will receive a response.

Apply with indeed

Upload Resume | Portfolio

File types (doc, docx, pdf, rtf, png, jpeg, jpg, bmp, jng, ppt, pptx, csv, gif) size up to 5MB

First name

Last name

Phone number

Location

By checking this box, you agree to our Terms of Service

Job Description

What you’ll work on

What tends to work well

Package & location

Our use of cookies