Job title: Applied Scientist, VLM / Vision Language
Job type: Permanent
Emp type: Full-time
Industry: Artificial Intelligence & Machine Learning
Skills: Diffusion Transformers Computer Vision 3D Python PyTorch
Salary type: Annual
Salary: negotiable
Location: United States
Job published: 11/02/2026
Job ID: 33847

Job Description

Applied Scientist – Vision Language Models (Multimodal Reasoning)

Ready to build VLMs that go beyond captioning and simple grounding?

This role is centred on advancing vision-language models that power intelligent agents operating in complex, real-world environments. The focus is firmly on multimodal model design, training, and post-training, with a mix of computer vision.

As an Applied Scientist, you’ll work on large multimodal models that integrate visual inputs with language-based reasoning. You’ll explore how VLMs can move from recognition and description toward structured understanding, task execution, and agentic decision-making.

Your work will include designing model architectures, improving cross-modal alignment, and developing post-training strategies that strengthen reasoning, factual consistency, and controllability. You’ll contribute across the full lifecycle, from data curation and supervised fine-tuning through to preference optimisation and evaluation.

This is a research-heavy role with clear production impact. You’ll prototype new ideas, run rigorous experiments, and collaborate with engineering teams to deploy models into live agent workflows.

Your focus will include:

  • Training and fine-tuning large-scale vision-language models
  • Improving multimodal alignment between image and text representations
  • Applying post-training techniques such as SFT, RLHF, DPO, and reward modelling
  • Designing evaluation frameworks for reasoning quality, grounding accuracy, and robustness
  • Working with large multimodal datasets, including synthetic and proprietary data

Hands-on work with VLMs or multimodal foundation models is essential. Experience in post-training, alignment, or preference learning is highly valued.

A solid understanding of how to evaluate multimodal systems, including hallucination, grounding failures, and reasoning gaps, is important. You should be comfortable reading and implementing recent research, and designing experiments that move models forward in measurable ways.

You’ll have ownership over modelling decisions and the opportunity to influence how multimodal intelligence is shaped within a fast-growing AI team.

Compensation: $200,000 - $320,000 base (negotiable depending on level) + bonus + meaningful equity + benefits

Location: SF Bay Area (Hybrid). Remote flexibility in the short term.

If you’re motivated by pushing vision-language models toward deeper reasoning and real-world capability, we’d like to speak with you!

All applicants will receive a response.

Questionnaire

Apply with indeed
File types (doc, docx, pdf, rtf, png, jpeg, jpg, bmp, jng, ppt, pptx, csv, gif) size up to 5MB