Job Description
Want to own the data infrastructure behind some of the most naturalistic voice models in production?
You'll be joining a well-funded speech AI startup — just closed their Series A — with strong enterprise traction and revenue that more than doubled last quarter. They're building ultra-realistic voice technology that handles natural laughter, breathing, seamless language switching, and accurate pronunciation across languages and accents. Their models are powering hundreds of millions of conversations monthly.
Before training a single model, they built their own corpus — full-duplex, studio-quality conversational speech annotated by PhD linguists. As their MLE, you'll own the pipelines that turn that raw material into clean, training-ready data.
What you'll do
- Own end-to-end data pipelines from raw audio ingestion through to versioned, training-ready datasets
- Build quality systems that catch annotation errors and alignment issues before they reach a training run
- Maintain the training infrastructure that keeps GPUs fed — dataloaders, streaming datasets, multi-modal batching
- Build and iterate on tooling across speech representations including neural codecs, semantic tokens and mel features
- Handle full- and half-duplex pipeline work including two-channel alignment and overlap handling
What you'll bring
- Strong engineering fundamentals with experience building ML data pipelines at scale
- Hands-on experience with speech or audio data
- Solid understanding of speech representations and the tradeoffs between them
- Experience with multi-channel audio data including diarisation and alignment
Nice to have
- Experience with multilingual data pipelines
- Large-scale training infrastructure experience — FSDP, DeepSpeed, Ray
- Annotation tooling and human-in-the-loop systems
Remote-friendly. Competitive base plus stock.