Job title:	Staff ML Infrastructure Engineer (GPU & Distributed Systems)
Job type:	Permanent
Emp type:	Full-time
Industry:	Research
Salary type:	Annual
Salary:	negotiable
Job published:	08/04/2026
Job ID:	35635

Job Description

Are you looking to scale GPU infrastructure up to and beyond 10,000 GPUs?

You'll help push an already high-performing team past their current operating level, using your skills and experience to scale training workloads, improve cluster reliability/usage and build systems that hold up under real pressure.

Your focus will be on distributed training and GPU infrastructure, making large-scale training actually usable for researchers—not just possible.

You'll be working across frontier model training, scientific workloads and robotics environments. So you're dealing with high-throughput systems and real-world constraints, not just controlled experiments.

You'll join a team that owns compute end-to-end—infra, systems, and operations—working closely with researchers to make training at this scale reliable.

They've raised over $500M, have real customers, and are now integrating models directly into robotics environments and beyond.

Key experience

Experience scaling GPU infrastructure from 2,000 to 10,000+ GPUs
Experience with Ray, Slurm or similar
Experience supporting core model training

The culture is collaborative and hands-on:

Strong focus on knowledge sharing and upskilling
Cross-team collaboration with researchers
6-week cycles to allow deep focus and meaningful impact
A team that works hard but also likes to keep it fun

Up to $350k base + bonus + equity DOE.

Remote across the US or hybrid options available in SF.

All applicants will receive a response.

Job Description

Our use of cookies