Your search has found 2 jobs

Want to work on one of the hardest unsolved problems in voice AI — making it actually sound like a human conversation?

Most voice AI falls apart the moment a conversation gets messy. Someone interrupts, emotions shift, the flow breaks — and the model can't keep up.

A small, ambitious SF startup is tackling exactly these problems, building speech models that handle natural conversation the way humans actually experience it. They have a working prototype and early commercial traction across several high-profile industry verticals.

The role

As a Senior Research Scientist, your focus is post-training — curating data, fine-tuning pre-trained speech models, and building the evaluation infrastructure that validates it all. You'll work on large-scale models with access to significant data resources.

What you'll do

  • Shape the data that goes into post-training — sourcing, cleaning and structuring it for large speech models

  • Supervised fine-tuning of pre-trained speech models

  • Build evaluation workflows — automated and human-in-the-loop

  • Drive measurable improvements in hallucination rates, instruction-following and generalisation

What you'll bring

  • PhD in ML or related field with a strong publications record

  • Hands-on experience training large speech models — ASR, TTS, or speech-to-speech

  • Solid post-training and SFT experience

The founding team includes a founding engineer from a billion-dollar AI company where they co-created one of the first generative models in the field, alongside the co-creator of the first generative voice at one of the world's largest tech companies.

Compensation is between $400k-$500k base with generous equity.

Based in San Francisco, onsite. Relocation support for those in US and willing to make the move.

Location: San Francisco
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 30/04/2026
Job ID: 34047

Ready to own the data pipeline powering the voice of the next generation of AI characters?

You'll be joining a well-funded startup building AI character technology, where speech is a core part of the product experience.

Think super natural conversations, handling interruptions, personality shifts and more!

You'll own the datasets that power their speech systems — from raw, messy audio through to clean, versioned training corpora that directly drive TTS and ASR model performance.

Your focus

  • Own the full data lifecycle — defining specs, auditing and curating large-scale audio and text corpora
  • Build automated quality metrics and dashboards across SNR, VAD, WER, speaker verification and safety, validated against listening tests
  • Train and deploy lightweight classifiers for noise detection, diarisation, language ID, and content moderation

What you'll bring

  • Deep experience working with speech and audio data at scale — 1M+ hours
  • Strong ML engineering skills in Python and PyTorch, including training and fine-tuning models like Whisper or Wav2Vec
  • Practical knowledge of audio processing — torchaudio, librosa, spectrograms, DSP basics
  • A solid understanding of audio quality metrics — MOS, WER, PESQ/STOI, SNR, speaker verification

Nice to have

  • Experience with Spark/Beam, Airflow, SQL or similar data engineering tools
  • Open-source contributions or publications in speech or audio ML
  • Background in denoising and enhancement, and how it affects downstream model quality

Remote, with a preference for European or overlapping timezones. Competitive compensation and equity.

Location: Remote
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 27/03/2026
Job ID: 34412