Your search has found 3 jobs

Want to own the data infrastructure behind some of the most naturalistic voice models in production?

You'll be joining a well-funded speech AI startup — just closed their Series A — with strong enterprise traction and revenue that more than doubled last quarter. They're building ultra-realistic voice technology that handles natural laughter, breathing, seamless language switching, and accurate pronunciation across languages and accents. Their models are powering hundreds of millions of conversations monthly.

Before training a single model, they built their own corpus — full-duplex, studio-quality conversational speech annotated by PhD linguists. As their MLE, you'll own the pipelines that turn that raw material into clean, training-ready data.

What you'll do

  • Own end-to-end data pipelines from raw audio ingestion through to versioned, training-ready datasets
  • Build quality systems that catch annotation errors and alignment issues before they reach a training run
  • Maintain the training infrastructure that keeps GPUs fed — dataloaders, streaming datasets, multi-modal batching
  • Build and iterate on tooling across speech representations including neural codecs, semantic tokens and mel features
  • Handle full- and half-duplex pipeline work including two-channel alignment and overlap handling

What you'll bring

  • Strong engineering fundamentals with experience building ML data pipelines at scale
  • Hands-on experience with speech or audio data
  • Solid understanding of speech representations and the tradeoffs between them
  • Experience with multi-channel audio data including diarisation and alignment

Nice to have

  • Experience with multilingual data pipelines
  • Large-scale training infrastructure experience — FSDP, DeepSpeed, Ray
  • Annotation tooling and human-in-the-loop systems

Remote-friendly. Competitive base plus stock.

Location: Remote, worldwide
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 04/05/2026
Job ID: 35965

Ready to architect the future of human-computer voice interaction?

Join an established conversational AI company as they transition from traditional cascaded speech systems to cutting-edge E2E speech-to-speech technology. You'll lead this transformation, building multimodal systems that will redefine how millions interact with AI.

The opportunity

You'll be leading the development of speech technology that directly impacts real users at massive scale. The company processes millions of daily interactions across major enterprise clients, meaning your research will shape real-world conversational experiences.

You'll spearhead the development of full-duplex speech systems, creating truly natural AI conversations that go far beyond current capabilities.

Your impact

  • Design and build next-generation multimodal speech LLM architecture from the ground up
  • Drive breakthroughs in speech-to-speech modeling and full-duplex conversation systems
  • Tackle turn-taking, interruption handling, and simultaneous speech processing
  • Bridge cutting-edge research with enterprise-grade production systems
  • Lead a growing team focused on SOTA speech-to-speech breakthroughs and own the development end-to-end

What you'll bring

  • Deep understanding of SOTA speech models and neural audio processing
  • Experience building speech language models/multimodal systems
  • Strong background in speech AI research and modern speech architectures

This is all underpinned by access to a large corpus of real enterprise conversational data and serious GPU infrastructure.

The company has built everything in-house, giving you complete technical control and the freedom to explore any approach that delivers value.

With their established market position and proven track record, you'll have the resources and real-world testing ground to make transformative impact with your research. 

Location

Remote (Must be within EU timezone).

Location: Remote
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 30/04/2026
Job ID: 33350

Want to build speech AI that actually sounds human?

You'll be joining a well-funded speech AI startup with strong customer traction. They're building ultra-realistic voice technology that handles natural laughter, breathing, seamless language switching, and accurate pronunciation across languages and accents.

As their Senior Research Scientist, you'll work hands-on to expand their foundation models and push the boundaries of what's possible in speech AI: exploring multilingual capabilities, long-context generation, full-duplex modeling for natural conversations with interruptions, and novel architectures that balance speed with control.

What you'll do

  • Conduct research to advance their core speech models and extend product capabilities
  • Develop and experiment with new model architectures and training approaches
  • Work on large-scale model training and data systems
  • Collaborate with the team to take research from concept to deployed systems

What you'll bring

  • 3+ years of experience in speech synthesis, audio generation, or generative modeling
  • Experience with audio generation using LLMs
  • Solid background in modern language model architectures
  • Proven ability to ship research into production systems
  • Experience training large-scale models

Nice to have

  • Published research in speech or generative modeling
  • Experience with real-time speech systems or multimodal models

Ideally in SF, but can also consider remote worldwide. Comp is up to $250K base DOE, plus equity.

Location: San Francisco, CA
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 23/12/2025
Job ID: 34579