Job title: Staff ML Infrastructure Engineer (GPU & Distributed Systems)
Job type: Permanent
Emp type: Full-time
Industry: Research
Salary type: Annual
Salary: negotiable
Job published: 08/04/2026
Job ID: 35635

Job Description

Are you looking to scale GPU infrastructure up to and beyond 10,000 GPUs?
You'll help push an already high-performing team past their current operating level, using your skills and experience to scale training workloads, improve cluster reliability/usage and build systems that hold up under real pressure.
Your focus will be on distributed training and GPU infrastructure, making large-scale training actually usable for researchers—not just possible.
You'll be working across frontier model training, scientific workloads and robotics environments. So you're dealing with high-throughput systems and real-world constraints, not just controlled experiments.
You'll join a team that owns compute end-to-end—infra, systems, and operations—working closely with researchers to make training at this scale reliable.
They've raised over $500M, have real customers, and are now integrating models directly into robotics environments and beyond.
Key experience
  • Experience scaling GPU infrastructure from 2,000 to 10,000+ GPUs
  • Experience with Ray, Slurm or similar
  • Experience supporting core model training

The culture is collaborative and hands-on:
  • Strong focus on knowledge sharing and upskilling
  • Cross-team collaboration with researchers
  • 6-week cycles to allow deep focus and meaningful impact
  • A team that works hard but also likes to keep it fun
Up to $350k base + bonus + equity DOE
Remote across the US or hybrid options available in SF

All applicants will receive a response. 
Apply with indeed
File types (doc, docx, pdf, rtf, png, jpeg, jpg, bmp, jng, ppt, pptx, csv, gif) size up to 5MB