Back to jobsmindrobotics
Machine Learning Infrastructure Engineer
Palo Alto, US on-site full time senior Jan 26, 2026
About this role
THE ROLE
At Mind Robotics, we’re building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure.
We’re looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering everything from experimentation to production deployment.
RESPONSIBILITIES
- Design and implement scalable systems for training large ML models
- Enable efficient workflows for data ingestion, training, and iteration
- Develop and optimize distributed training systems across hundreds of GPUs
- Implement strategies for parallelization, sharding, and efficient compute utilization
- Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management
- Partner closely with modeling teams to accelerate iteration speed and reduce training costs
- Build internal tools for experiment tracking, monitoring, and debugging
- Implement systems for tracking training performance, failures, and resource utilization
- Debug and resolve bottlenecks across the training stack
- Provide lightweight infrastructure support for deploying and running models in production environments
- Optimize inference performance and reliability where needed
- Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)
- Manage compute resources efficiently across training jobs
QUALIFICATIONS
- Strong experience building infrastructure for large-scale ML training
- Deep understanding of how modern LLM/VLM systems are trained and scaled
- Proven experience setting up and scaling distributed training across hundreds of GPUs
- Strong understanding of parallelization strategies (data, model, pipeline parallelism)
- Strong proficiency in Python programming
- Expert-level proficiency in PyTorch and/or JAX
- Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage
NICE TO HAVE
- Experience supporting inference systems in production
- Familiarity with robotics or embodied AI workloads
- Experience building tools for experiment management and researcher productivity