Machine Learning Infrastructure Engineer

Palo Alto, US on-site full time senior Jan 26, 2026

Skills

About this role

THE ROLE At Mind Robotics, we’re building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure. We’re looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering everything from experimentation to production deployment. RESPONSIBILITIES - Design and implement scalable systems for training large ML models - Enable efficient workflows for data ingestion, training, and iteration - Develop and optimize distributed training systems across hundreds of GPUs - Implement strategies for parallelization, sharding, and efficient compute utilization - Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management - Partner closely with modeling teams to accelerate iteration speed and reduce training costs - Build internal tools for experiment tracking, monitoring, and debugging - Implement systems for tracking training performance, failures, and resource utilization - Debug and resolve bottlenecks across the training stack - Provide lightweight infrastructure support for deploying and running models in production environments - Optimize inference performance and reliability where needed - Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead) - Manage compute resources efficiently across training jobs QUALIFICATIONS - Strong experience building infrastructure for large-scale ML training - Deep understanding of how modern LLM/VLM systems are trained and scaled - Proven experience setting up and scaling distributed training across hundreds of GPUs - Strong understanding of parallelization strategies (data, model, pipeline parallelism) - Strong proficiency in Python programming - Expert-level proficiency in PyTorch and/or JAX - Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage NICE TO HAVE - Experience supporting inference systems in production - Familiarity with robotics or embodied AI workloads - Experience building tools for experiment management and researcher productivity