Job description
Senior Machine Learning Engineer (Distributed Systems)
$175,000 to $250,000 + Equity + Benefits + PTO
Palo Alto, CA - on site
Are you a world-class engineer with a passion for scaling high-performance systems? Do you want to work at the cutting edge of generative AI infrastructure and help build the backbone of tomorrow's foundation models?
This is an incredible opportunity to work on the cutting edge of multimodal machine learning, whilst benefiting from strong equity (significantly above market average), and benefiting from excellent internal progression.
I'm working with a very well-funded AI startup with strong revenues that is expanding its top-tier Research Engineering team. They're focused on rethinking how multimodal foundation models are trained-pushing the limits of distributed computing, GPU efficiency, and end-to-end optimization.
As a Distributed Systems Engineer, you'll collaborate with research scientists to develop and scale core infrastructure that trains next-gen models on multi-thousand GPU clusters. You'll tackle real-world performance bottlenecks, design resilient distributed systems, and optimize everything from custom CUDA kernels to model inference pipelines.
This is a rare opportunity to have direct technical impact in a fast-paced, research-driven environment alongside some of the brightest minds in AI, whilst continuing to progress both your technical skills and career.
The Role
- Architect and scale infrastructure for training large-scale models across massive GPU clusters
- Optimize training performance and hardware utilization end-to-end (Python, PyTorch, CUDA, Triton)
- Build systems for efficient workload distribution, fault tolerance, and job recovery
- Deploy optimized inference systems with a focus on throughput and low-latency
- Contribute to prototyping next-gen applications in multimodal generative AI
- On-site in Palo Alto, CA
Ideal Candidate
- Experience working with large-scale ML systems or high-performance computing
- Strong Python and PyTorch engineering background; deep understanding of training pipelines
- Proficient in distributed frameworks (DDP, FSDP, tensor/model parallelism)
- Expertise in GPU/CPU performance profiling (e.g., Nsight), CUDA and Triton optimization, and custom kernel development
- Deep understanding of distributed systems and frameworks, such as DDP, FSDP, and tensor parallelism
- Strong generalist software engineering skills (e.g., C++, debugging, systems design)
- Bonus: experience with generative models (Transformers, Diffusion, GANs), and fast prototyping tools (Gradio, Docker)
