Job title: Applied Research Engineer - Synthetic Data
Job type: Permanent
Emp type: Full-time
Industry: Artificial Intelligence & Machine Learning
Salary type: Annual
Salary: negotiable
Location: NYC or London
Job published: 15/04/2025
Job ID: 33152

Job Description

Shape the future of agentic AI through cutting-edge data strategy

Want to pioneer next-generation data techniques for advanced AI systems? This role combines frontier model research with practical implementation at one of Europe's most ambitious AI startups.

You'll join a rapidly growing AI Data team developing cutting-edge data-centric approaches that enhance LLMs, VLMs, and Action Models. This isn't just about collecting data – it's about transforming how AI systems learn and operate through synthetic generation, model distillation, and preference alignment.

Founded with a clear mission to push the boundaries of superintelligent agentic AI, this well-funded startup ($200M raised) is assembling world-class talent focused on both advancing capabilities and ensuring responsible development. Their approach is comprehensive – building proprietary technology from data to models, focusing on language, multimodal, and vision systems with superior performance and cost-effectiveness.

As an Applied Engineer focusing on Data Research, you'll develop sophisticated data strategies that directly impact frontier AI systems:

  • Generate and augment synthetic multimodal datasets for VQA, agent behaviours, and virtual navigation
  • Apply model distillation techniques to optimise large-scale models for edge deployment
  • Design evaluation frameworks to measure improvements across multiple domains
  • Lead research into aligning data with human and AI preferences
  • Collaborate with cross-functional teams to integrate data-driven solutions

This role offers rare access to significant compute resources, with a massive GPU cluster that enables cutting-edge work. You'll be joining at a pivotal stage where your contributions will shape core technology and direction.

Requirements:

  • Strong Python programming skills covering parallel computing, system design, and large-scale deployments
  • Experience developing multimodal data pipelines
  • Background in training and deploying LLMs, VLMs or PyTorch models
  • MSc or PhD in machine learning, computer vision, NLP, or related field
  • Deep understanding of training and evaluation paradigms for multimodal models
  • Effectiveness in fast-changing environments

Nice to have:

  • Experience with agent-specific data pipelines
  • Background in multimodal human annotation platforms
  • Document understanding/OCR expertise
  • Synthetic data generation experience (particularly multimodal)

You'll have flexibility to work from New York, London, or remotely within European or US East Coast time zones. For those based in cities with offices, hybrid arrangements are available.

Your package includes a highly competitive salary ($200,000-$350,000 depending on experience) plus significant equity with strong upside potential.

If you're passionate about advancing AI through innovative data approaches and want to make a lasting impact on agentic systems, we'd love to hear from you. All applicants will receive a response.

Questionnaire

Apply with indeed
File types (doc, docx, pdf, rtf, png, jpeg, jpg, bmp, jng, ppt, pptx, csv, gif) size up to 5MB