Job Description
Senior Applied Researcher
Want to build vision-language models that understand complex, real-world environments?
You’ll join a small, highly technical team working on foundational problems in multimodal AI, focused on training models that can interpret, reason, and act on large-scale first-person video data.
You’ll work directly with the Chief Science Officer, shaping how models are designed, trained, and evaluated. The work sits at the intersection of VLMs, long-context reasoning, and real-world deployment.
The focus is on building systems that move beyond static perception, towards temporal understanding, activity recognition, and higher-level reasoning across dynamic environments.
Your work will centre on:
- Designing and training VLMs on large-scale video datasets
- Developing post-training approaches including SFT, RLHF, and parameter-efficient tuning
- Building scalable training and evaluation pipelines
- Exploring long-context and temporal modelling
- Designing efficient systems across edge and server-side inference
- Defining benchmarks for spatial and behavioural understanding
You’ll bring strong experience training deep learning models, ideally transformer-based, along with hands-on work in vision, language, or multimodal systems.
Experience with large datasets, model optimisation, or deploying models into production environments will be valuable. Exposure to video data or long-context modelling is particularly relevant.
This is a team that values speed, ownership, and first-principles thinking. You’ll be working on open-ended problems with real-world impact, with the freedom to explore and define approaches.
Compensation: Highly competitive salary + equity
Location: San Francisco, onsite
If you’re interested in building multimodal systems that operate in real-world settings, and want to join a well-funded, highly skilled research team, please apply now!
All applicants will receive a response.