Job description
Senior Research Engineer (Data)
$175,000 - $250,000 + Equity + Benefits + PTO
Palo Alto, CA - On-site
Are you passionate about scaling data systems that fuel state-of-the-art AI? Want to play a mission-critical role in training cutting-edge generative models by designing the data infrastructure they rely on?
This is a rare opportunity to join a top-tier AI startup as they continue to push the boundaries of what's possible in multimodal generative AI - you'll be joining a high-performing, research-driven team with significant funding and strong momentum, in a high-impact position at the intersection of research and infrastructure.
I'm working with a well-funded AI startup in Palo Alto that's scaling its Research Engineering division. They're looking for a Senior Research Engineer focused on data systems-someone who understands how critical clean, diverse, and scalable data pipelines are to generative model performance. If you're excited about building high-quality datasets and architecting systems that impact billions of tokens, this is your chance to make a huge impact.
In this role, you'll partner closely with researchers to build end-to-end data acquisition and processing pipelines. You'll source novel data types, design filtering and deduplication systems, integrate active learning techniques, and help steer research directions based on model gaps. It's a role that combines engineering, research, and strategy-at serious scale.
This is a rare opportunity to have direct technical impact in a fast-paced, research-driven environment alongside some of the brightest minds in AI, whilst continuing to progress both your technical skills and career.
The Role
- Architect and maintain scalable pipelines for sourcing, deduplicating, filtering, and preparing massive datasets for training.
- Partner with research scientists to identify model gaps and improve dataset relevance and diversity.
- Collaborate with annotation ops to enhance dataset quality through smart filtering strategies.
- Integrate self-supervised active learning and other advanced data techniques to scale systems efficiently.
- Contribute directly to the performance of cutting-edge video generation models and other generative systems.
- On-site in Palo Alto, CA
Ideal Candidate
- Experience building large-scale data pipelines in domains like computer vision, NLP, robotics, or autonomous systems.
- Strong Python skills, with familiarity in deep learning frameworks such as PyTorch.
- Experience working with large data processing frameworks (e.g., SQL, Spark).
- Solid understanding of distributed systems and performance-aware data infrastructure.
- Proven track record of delivering robust data solutions in fast-paced, research-heavy environments.
- Bonus: experience in data-centric AI, self-supervised learning, or active learning methods.
