Job title:	MLOps Enginer
Job type:	Permanent
Emp type:	Full-time
Industry:	Artificial Intelligence & Machine Learning
Salary type:	Annual
Salary:	negotiable
Location:	United States
Job published:	05/06/2025
Job ID:	33218

Job Description

Build the ML infrastructure that powers cutting-edge AI across multiple domains

Ready to architect MLOps systems from the ground up for a fast-growing AI team? This greenfield opportunity offers complete autonomy to design and build training pipelines for LLMs, computer vision models, and other deep learning architectures that will power next-generation AI applications.

You'll join a well-funded startup ($20M+ raised, with a new round expected this year) developing production-grade AI solutions across regulated industries including healthcare, aerospace and manufacturing. Founded by a successful entrepreneur with a previous billion-dollar exit, they're already partnering with Fortune 100 and 500 clients where standard AI approaches fall short.

This role offers exceptional technical ownership - you'll build their ML infrastructure from current basic tooling to production-scale systems that support their rapidly expanding applied AI team. They have significant GPU resources with substantial budget growth expected. As the team scales to ~20 people within the year, there's high potential for you to lead future MLOps hires.

The challenge is substantial: creating infrastructure that supports training across multiple modalities - from LLMs to computer vision models. You'll work with large compute resources and have complete autonomy to select and implement the tooling that will define how the team operates for years to come. Your initial focus will be establishing robust training and evaluation pipelines, then scaling to enterprise-grade data workflows with versioning, monitoring, and automated deployment systems.

Your focus:

Build training and evaluation pipelines for LLMs, vision models, and other deep learning architectures
Design distributed training systems on multi-GPU clusters across model types
Create scalable data pipelines, versioning systems, and model checkpointing workflows
Implement model serving infrastructure with tools like vLLM, Triton, and TorchServe
Establish comprehensive monitoring, experiment tracking, and reproducibility systems
Support a rapidly growing applied AI team with robust CI/CD workflows for ML systems

You should have:

3+ years building MLOps infrastructure or ML systems in production environments
Hands-on experience with training pipelines for deep learning models (LLMs, CNNs, transformers)
Strong expertise with AWS and Kubernetes (mandatory requirements)
Proficiency with Python, PyTorch/TensorFlow, and distributed training libraries
Experience with model tracking tools like Weights & Biases or MLflow
Understanding of modern ML architectures across multiple domains

Nice to have:

Experience with LLM inference tools (vLLM, SGLang, RayServer)
Ray experience for distributed computing
Knowledge of mixed-precision training, quantisation, and model optimisation
Computer vision workflow experience
Data versioning tools (DVC, LakeFS)
Early-stage startup experience

You'll receive:

Competitive base salary: circa $250K (based on experience)
Significant stock package in a fast-growing company
Access to substantial GPU budget with expected growth
Healthcare (medical, dental, vision) and 401k with matching
20 vacation days plus flexible working arrangements

You must be based in SF Bay Area or Miami (relocation is provided to Florida only). At this time we can only consider US citizens or green card holders.nd.

Ready to build the infrastructure that powers the future of production AI? All applicants will receive a response.

Location:	San Francisco Bay Area,
Job type:	Permanent
Emp type:	Full-time
Salary type:	Annual
Salary:	negotiable
Job published:	24/04/2025
Job ID:	33119

Location:	NYC or London
Job type:	Permanent
Emp type:	Full-time
Salary type:	Annual
Salary:	negotiable
Job published:	15/04/2025
Job ID:	33152

Job Description

Build the ML infrastructure that powers cutting-edge AI across multiple domains

Our use of cookies