Job Description
Training builds capability. Post-training decides what it becomes.
This team are rethinking how large multimodal models learn after pre-training — developing post-training and reinforcement learning methods that help models reason, plan, and interact in real time.
Founded by the researchers behind several of the most influential modern AI architectures, this lab are pushing alignment and learning efficiency beyond standard RLHF. They’re scaling preference-based training (RLHF, DPO, hybrid feedback loops) to new model types and creating systems that learn from interaction rather than static data.
You’ll work at the intersection of post-training, RL, and model architecture — designing reward models, scalable evaluation frameworks, and training strategies that make large-scale learning measurable and reliable. It’s applied research with direct impact, supported by serious compute and a tight researcher-to-GPU ratio.
You’ll bring experience in large-scale post-training or reinforcement learning (RLHF, DPO, or SFT pipelines), a solid grasp of LLM or multimodal training systems, and the curiosity to explore new optimisation and alignment methods. A publication record at top venues (NeurIPS, ICLR, ICML, CVPR, ACL) is a plus, but impact matters more than titles.
The team are based in San Francisco, working mostly in person. $1 million+ total compensation. Base salary circa $300K – $600K (negotiable) plus stock and bonus — exact package depends on experience.
If you want to work where post-training meets architecture — shaping how foundation models learn, reason, and adapt — this is that opportunity.
All applicants will receive a response.