Job Description
Are you looking to scale GPU infrastructure up to and beyond 10,000 GPUs?
You'll help push an already high-performing team past their current operating level, using your skills and experience to scale training workloads, improve cluster reliability/usage and build systems that hold up under real pressure.
Your focus will be on distributed training and GPU infrastructure, making large-scale training actually usable for researchers—not just possible.
You'll be working across frontier model training, scientific workloads and robotics environments. So you're dealing with high-throughput systems and real-world constraints, not just controlled experiments.
You'll join a team that owns compute end-to-end—infra, systems, and operations—working closely with researchers to make training at this scale reliable.
They've raised over $500M, have real customers, and are now integrating models directly into robotics environments and beyond.
Key experience
- Experience scaling GPU infrastructure from 2,000 to 10,000+ GPUs
- Experience with Ray, Slurm or similar
- Experience supporting core model training
The culture is collaborative and hands-on:
- Strong focus on knowledge sharing and upskilling
- Cross-team collaboration with researchers
- 6-week cycles to allow deep focus and meaningful impact
- A team that works hard but also likes to keep it fun
Up to $350k base + bonus + equity DOE
Remote across the US or hybrid options available in SF
All applicants will receive a response.
All applicants will receive a response.