Job Description
Build the ML infrastructure that powers cutting-edge AI across multiple domains
Ready to architect MLOps systems from the ground up for a fast-growing AI team? This greenfield opportunity offers complete autonomy to design and build training pipelines for LLMs, computer vision models, and other deep learning architectures that will power next-generation AI applications.
You'll join a well-funded startup ($20M+ raised, with a new round expected this year) developing production-grade AI solutions across regulated industries including healthcare, aerospace and manufacturing. Founded by a successful entrepreneur with a previous billion-dollar exit, they're already partnering with Fortune 100 and 500 clients where standard AI approaches fall short.
This role offers exceptional technical ownership - you'll build their ML infrastructure from current basic tooling to production-scale systems that support their rapidly expanding applied AI team. They have significant GPU resources with substantial budget growth expected. As the team scales to ~20 people within the year, there's high potential for you to lead future MLOps hires.
The challenge is substantial: creating infrastructure that supports training across multiple modalities - from LLMs to computer vision models. You'll work with large compute resources and have complete autonomy to select and implement the tooling that will define how the team operates for years to come. Your initial focus will be establishing robust training and evaluation pipelines, then scaling to enterprise-grade data workflows with versioning, monitoring, and automated deployment systems.
Your focus:
- Build training and evaluation pipelines for LLMs, vision models, and other deep learning architectures
- Design distributed training systems on multi-GPU clusters across model types
- Create scalable data pipelines, versioning systems, and model checkpointing workflows
- Implement model serving infrastructure with tools like vLLM, Triton, and TorchServe
- Establish comprehensive monitoring, experiment tracking, and reproducibility systems
- Support a rapidly growing applied AI team with robust CI/CD workflows for ML systems
You should have:
- 3+ years building MLOps infrastructure or ML systems in production environments
- Hands-on experience with training pipelines for deep learning models (LLMs, CNNs, transformers)
- Strong expertise with AWS and Kubernetes (mandatory requirements)
- Proficiency with Python, PyTorch/TensorFlow, and distributed training libraries
- Experience with model tracking tools like Weights & Biases or MLflow
- Understanding of modern ML architectures across multiple domains
Nice to have:
- Experience with LLM inference tools (vLLM, SGLang, RayServer)
- Ray experience for distributed computing
- Knowledge of mixed-precision training, quantisation, and model optimisation
- Computer vision workflow experience
- Data versioning tools (DVC, LakeFS)
- Early-stage startup experience
You'll receive:
- Competitive base salary: circa $250K (based on experience)
- Significant stock package in a fast-growing company
- Access to substantial GPU budget with expected growth
- Healthcare (medical, dental, vision) and 401k with matching
- 20 vacation days plus flexible working arrangements
You must be based in SF Bay Area or Miami (relocation is provided to Florida only). At this time we can only consider US citizens or green card holders.nd.
Ready to build the infrastructure that powers the future of production AI? All applicants will receive a response.