Job Description
Looking to push the boundaries of generative AI for real-time interaction?
You'll be joining a well- funded startup working on multimodal AI where voice, vision, and language come together.
They're building generative models for natural conversational experiences that need to perform in real-time.
There's no limitations with resources here, they have plenty of compute for you to run experiments at scale. You'll be working alongside a well known open-source leader, as well as a very strong speech R&D team from leading companies.
Your mission
You'll be building and optimising diffusion or flow-matching models that power their speech and audio generation. This means developing production-ready architectures that can generate controllable, high-quality output at scale.
You'll own the full research-to-production pipeline - from architecture design and training through deployment and optimisation.
Your work will directly impact how millions of AI characters sound and interact.
Your focus
-
Design and train large-scale diffusion or flow-matching models
-
Develop novel architectures and training techniques to improve controllability and quality
-
Build evaluation systems to measure generation quality and model behaviour
-
Work from low-level performance optimisations to high-level model design
What you'll bring
-
Proven track record building diffusion models or flow-matching systems (this can be applied to other modalities)
-
Experience training large models (3B+ parameters) with distributed systems
-
Hands-on experience with streaming or distillation of diffusion models
Nice to have
-
Experience with audio or speech generation
-
Publications or open-source contributions in diffusion models or generative AI
Remote in Europe. Base salary is between €140-200K DOE (with some flex for the right person). Plus generous stock.