Your search has found 8 jobs

Want to work on one of the hardest unsolved problems in voice AI — making it actually sound like a human conversation?

Most voice AI falls apart the moment a conversation gets messy. Someone interrupts, emotions shift, the flow breaks — and the model can't keep up.

A small, ambitious SF startup is tackling exactly these problems, building speech models that handle natural conversation the way humans actually experience it. They have a working prototype and early commercial traction across several high-profile industry verticals.

The role

As a Senior Research Scientist, your focus is post-training — curating data, fine-tuning pre-trained speech models, and building the evaluation infrastructure that validates it all. You'll work on large-scale models with access to significant data resources.

What you'll do

  • Shape the data that goes into post-training — sourcing, cleaning and structuring it for large speech models

  • Supervised fine-tuning of pre-trained speech models

  • Build evaluation workflows — automated and human-in-the-loop

  • Drive measurable improvements in hallucination rates, instruction-following and generalisation

What you'll bring

  • PhD in ML or related field with a strong publications record

  • Hands-on experience training large speech models — ASR, TTS, or speech-to-speech

  • Solid post-training and SFT experience

The founding team includes a founding engineer from a billion-dollar AI company where they co-created one of the first generative models in the field, alongside the co-creator of the first generative voice at one of the world's largest tech companies.

Compensation is between $400k-$500k base with generous equity.

Based in San Francisco, onsite. Relocation support for those in US and willing to make the move.

Location: San Francisco
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 30/04/2026
Job ID: 34047

Ready to architect the future of human-computer voice interaction?

Join an established conversational AI company as they transition from traditional cascaded speech systems to cutting-edge E2E speech-to-speech technology. You'll lead this transformation, building multimodal systems that will redefine how millions interact with AI.

The opportunity

You'll be leading the development of speech technology that directly impacts real users at massive scale. The company processes millions of daily interactions across major enterprise clients, meaning your research will shape real-world conversational experiences.

You'll spearhead the development of full-duplex speech systems, creating truly natural AI conversations that go far beyond current capabilities.

Your impact

  • Design and build next-generation multimodal speech LLM architecture from the ground up
  • Drive breakthroughs in speech-to-speech modeling and full-duplex conversation systems
  • Tackle turn-taking, interruption handling, and simultaneous speech processing
  • Bridge cutting-edge research with enterprise-grade production systems
  • Lead a growing team focused on SOTA speech-to-speech breakthroughs and own the development end-to-end

What you'll bring

  • Deep understanding of SOTA speech models and neural audio processing
  • Experience building speech language models/multimodal systems
  • Strong background in speech AI research and modern speech architectures

This is all underpinned by access to a large corpus of real enterprise conversational data and serious GPU infrastructure.

The company has built everything in-house, giving you complete technical control and the freedom to explore any approach that delivers value.

With their established market position and proven track record, you'll have the resources and real-world testing ground to make transformative impact with your research. 

Location

Remote (Must be within EU timezone).

Location: Remote
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 30/04/2026
Job ID: 33350

Want to build the speech and audio models that define how the next generation of voice AI actually sounds and listens?

A well-funded AI startup has developed new model architectures that make real-time conversational AI finally viable at scale. While most voice AI still suffers from delays and computational bottlenecks, they've solved the core efficiency problems that have held the field back.

The role

As their Senior Research Scientist, you'll build core speech foundation models that could define the next decade of voice interaction. You'll work on novel architectures that have immediate real-world impact for thousands of customers.

What you'll do

  • Design and implement SOTA speech foundation models

  • Develop efficient algorithms for speech processing and audio understanding

  • Create scalable systems that handle massive audio workloads

  • Build comprehensive evaluation methods to validate model performance

  • Collaborate with engineering teams to transition research into production

What you'll bring

  • Deep expertise in modern speech technologies (TTS, Speech LLMs, Voice Conversion/Cloning, Speech Translation, ASR, Audio Understanding)

  • Strong background in generative modelling for audio and speech

  • Publications at leading conferences

  • Track record of implementing research ideas from concept to production

You'll join a solid research team, including technical founders who've published work that's fundamentally shifted how the field thinks about efficient, large-scale foundation models. They're well-funded and generating strong revenue. Comp is on par with top AI labs, with base over $400k+ DOE plus a generous equity package.

The role is based in San Francisco, hybrid with 4 days a week in the office.

If you're excited about building the foundational models that will power the next generation of voice AI, we'd love to hear from you.

All applicants will receive a response.

Location: San Francisco
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 30/04/2026
Job ID: 33251

Ready to own the data pipeline powering the voice of the next generation of AI characters?

You'll be joining a well-funded startup building AI character technology, where speech is a core part of the product experience.

Think super natural conversations, handling interruptions, personality shifts and more!

You'll own the datasets that power their speech systems — from raw, messy audio through to clean, versioned training corpora that directly drive TTS and ASR model performance.

Your focus

  • Own the full data lifecycle — defining specs, auditing and curating large-scale audio and text corpora
  • Build automated quality metrics and dashboards across SNR, VAD, WER, speaker verification and safety, validated against listening tests
  • Train and deploy lightweight classifiers for noise detection, diarisation, language ID, and content moderation

What you'll bring

  • Deep experience working with speech and audio data at scale — 1M+ hours
  • Strong ML engineering skills in Python and PyTorch, including training and fine-tuning models like Whisper or Wav2Vec
  • Practical knowledge of audio processing — torchaudio, librosa, spectrograms, DSP basics
  • A solid understanding of audio quality metrics — MOS, WER, PESQ/STOI, SNR, speaker verification

Nice to have

  • Experience with Spark/Beam, Airflow, SQL or similar data engineering tools
  • Open-source contributions or publications in speech or audio ML
  • Background in denoising and enhancement, and how it affects downstream model quality

Remote, with a preference for European or overlapping timezones. Competitive compensation and equity.

Location: Remote
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 27/03/2026
Job ID: 34412

Training builds capability. Post-training decides what it becomes.

This team are rethinking how large multimodal models learn after pre-training — developing post-training and reinforcement learning methods that help models reason, plan, and interact in real time.

Founded by the researchers behind several of the most influential modern AI architectures, this lab are pushing alignment and learning efficiency beyond standard RLHF. They’re scaling preference-based training (RLHF, DPO, hybrid feedback loops) to new model types and creating systems that learn from interaction rather than static data.

You’ll work at the intersection of post-training, RL, and model architecture — designing reward models, scalable evaluation frameworks, and training strategies that make large-scale learning measurable and reliable. It’s applied research with direct impact, supported by serious compute and a tight researcher-to-GPU ratio.

You’ll bring experience in large-scale post-training or reinforcement learning (RLHF, DPO, or SFT pipelines), a solid grasp of LLM or multimodal training systems, and the curiosity to explore new optimisation and alignment methods. A publication record at top venues (NeurIPS, ICLR, ICML, CVPR, ACL) is a plus, but impact matters more than titles.

The team are based in San Francisco, working mostly in person. $1 million+ total compensation. Base salary circa $300K – $600K (negotiable) plus stock and bonus — exact package depends on experience.

If you want to work where post-training meets architecture — shaping how foundation models learn, reason, and adapt — this is that opportunity.

All applicants will receive a response.

Location: San Francisco
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 11/02/2026
Job ID: 34012

GPU Optimisation Engineer — Real-Time Inference

Want to push GPU performance to its limits — not in theory, but in production systems handling real-time speech and multimodal workloads?

This team is building low-latency AI systems where milliseconds actually matter. The target isn’t “faster than baseline.” It’s sub-50ms time-to-first-token at 100+ concurrent requests on a single H100 — while maintaining model quality.

They’re hiring a GPU Optimisation Engineer who understands GPUs at an architectural level. Someone who knows where performance is really lost: memory hierarchy, kernel launch overhead, occupancy limits, scheduling inefficiencies, KV cache behaviour, attention paths. The work sits close to the metal, inside inference execution — not general infra, not model research.

You’ll operate across the kernel and runtime layers, profiling large-scale speech and multimodal models end-to-end and removing bottlenecks wherever they appear.

What you’ll work on

  • Profiling GPU bottlenecks across memory bandwidth, kernel fusion, quantisation, and scheduling

  • Writing and tuning custom CUDA / Triton kernels for performance-critical paths

  • Improving attention, decoding, and KV cache efficiency in inference runtimes

  • Modifying and extending vLLM-style systems to better suit real-time workloads

  • Optimising models to fit GPU memory constraints without degrading output quality

  • Benchmarking across NVIDIA GPUs (with exposure to AMD and other accelerators over time)

  • Partnering directly with research to turn new model ideas into fast, production-ready inference

This is hands-on optimisation work across the stack. No layers of bureaucracy. No “platform ownership” theatre. Just deep performance engineering applied to models that are actively evolving.

What tends to work well

  • Strong experience with CUDA and/or Triton

  • Deep understanding of GPU execution (memory hierarchy, scheduling, occupancy, concurrency)

  • Experience optimising inference latency and throughput for large generative models

  • Familiarity with attention kernels, decoding paths, or LLM-style runtimes

  • Comfort profiling with low-level GPU tooling

The company is revenue-generating, its models are used by global enterprises, and the SF R&D team is expanding following a recent raise. This is growth hiring, not backfill.

Package & location

  • Base salary: up to ~$300,000 (negotiable based on depth)

  • Equity: Meaningful stock

  • Location: San Francisco preferred (relocation and visa sponsorship can be provided)

If you care about real-time constraints, GPU architecture, and squeezing every last millisecond out of large models, this is worth a conversation.

All applicants will receive a response.

Location: San Francisco, CA
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 11/02/2026
Job ID: 34843

Looking to push the boundaries of generative AI for real-time interaction?

You'll be joining a well- funded startup working on multimodal AI where voice, vision, and language come together. 

They're building generative models for natural conversational experiences that need to perform in real-time.

There's no limitations with resources here, they have plenty of compute for you to run experiments at scale. You'll be working alongside a well known open-source leader, as well as a very strong speech R&D team from leading companies. 

Your mission

You'll be building and optimising diffusion or flow-matching models that power their speech and audio generation. This means developing production-ready architectures that can generate controllable, high-quality output at scale.

You'll own the full research-to-production pipeline - from architecture design and training through deployment and optimisation. 

Your work will directly impact how millions of AI characters sound and interact.

Your focus

  • Design and train large-scale diffusion or flow-matching models

  • Develop novel architectures and training techniques to improve controllability and quality

  • Build evaluation systems to measure generation quality and model behaviour

  • Work from low-level performance optimisations to high-level model design

What you'll bring

  • Proven track record building diffusion models or flow-matching systems (this can be applied to other modalities)

  • Experience training large models (3B+ parameters) with distributed systems

  • Hands-on experience with streaming or distillation of diffusion models

Nice to have

  • Experience with audio or speech generation

  • Publications or open-source contributions in diffusion models or generative AI

Remote in Europe. Base salary is between €140-200K DOE (with some flex for the right person). Plus generous stock. 

Location: Remote
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 26/01/2026
Job ID: 34280

Want to build speech AI that actually sounds human?

You'll be joining a well-funded speech AI startup with strong customer traction. They're building ultra-realistic voice technology that handles natural laughter, breathing, seamless language switching, and accurate pronunciation across languages and accents.

As their Senior Speech Scientist, you'll work hands-on to expand their foundation models and push the boundaries of what's possible in speech AI: exploring multilingual capabilities, long-context generation, full-duplex modeling for natural conversations with interruptions, and novel architectures that balance speed with control.

What you'll do

  • Conduct research to advance their core speech models and extend product capabilities
  • Develop and experiment with new model architectures and training approaches
  • Work on large-scale model training and data systems
  • Collaborate with the team to take research from concept to deployed systems

What you'll bring

  • 3+ years of experience in speech synthesis, audio generation, or generative modeling
  • Experience with audio generation using LLMs
  • Solid background in modern language model architectures
  • Proven ability to ship research into production systems
  • Experience training large-scale models

Nice to have

  • Published research in speech or generative modeling
  • Experience with real-time speech systems or multimodal models

Ideally in SF, but can also consider remote worldwide. Comp is up to $250K base DOE, plus equity.

Location: San Francisco, CA
Job type: Permanent
Emp type: Full-time
Salary type: Annual
Salary: negotiable
Job published: 23/12/2025
Job ID: 34579