Staff Engineer, CI/CD & Cloud Infrastructure

$175k – $185k/yr San Diego, US on-site full time senior 24d ago

Skills

airflow aws cmake cuda docker elasticsearch grafana helm kubernetes loki make opensearch pip prometheus python setuptools terraform

About this role

Staff Engineer, CI/CD & Cloud Infrastructure Location: San Diego, CA Job Type: Full-Time Salary Range: $ 175,000 - $185,000 Position Overview We are looking for a Staff CI/CD & Cloud Infrastructure Engineer to own and evolve our build pipelines, deployment workflows, and cloud infrastructure. You will be responsible for ensuring that software — spanning Python, C/C++, and CUDA on Linux — is built, tested, versioned, and deployed reliably across both AWS cloud environments and a fleet of complex embedded instruments operated in our central lab facility. This is a senior hands-on role for an engineer who thrives at the intersection of DevOps automation, cloud infrastructure management, and release engineering. You will design and maintain CI/CD pipelines, manage complex AWS infrastructure as code, and ensure full traceability from source commits through builds, tests, artifacts, and deployments. You will work cross-functionally with firmware, application, and HPC engineers to keep the entire delivery pipeline fast, reliable, and observable. Key Responsibilities CI/CD & Build Engineering - Design, build, and maintain CI/CD pipelines using GitHub Actions or similar platforms - Manage build systems for Python, C/C++, and CUDA codebases on Linux - Integrate build tools (CMake, Make, pip, setuptools) into automated pipelines - Implement robust versioning, tagging, and artifact management strategies - Ensure full traceability of builds, test results, and artifacts from commit to deployment - Manage Docker-based build environments including base images, caching, and reproducibility - Maintain and optimize build performance, parallelism, and reliability Cloud Infrastructure (AWS) - Architect and manage complex AWS infrastructure including: - IAM roles, policies, and access management - Storage services (S3, EBS, EFS) with tiered lifecycle policies - Databases (RDS, DynamoDB, or similar) with backup and failover strategies - Data workflow and pipeline engines (Step Functions, Airflow, or similar) - Compute services (EC2, ECS, EKS, Lambda) scaled to workload requirements - Implement infrastructure as code using Terraform - Manage Kubernetes clusters and Helm charts for containerized - workloads - Design for scalability, high availability, and disaster recovery - Manage cost optimization, resource tagging, and infrastructure - governance - Support multi-account and multi-region strategies as needed - Familiarity with Azure and GCP for secondary or hybrid - requirements On-Premises HPC & Hybrid Infrastructure - Provision, configure, and manage on-premises Linux HPC nodes used for secondary and tertiary data processing - Define infrastructure-as-code (Terraform, Ansible, or similar) for reproducible HPC node provisioning and configuration - Manage high-speed networking infrastructure between instruments, HPC nodes, and storage (configuration, monitoring, troubleshooting) - Implement and manage shared storage systems (NFS, parallel filesystems, or similar) accessible to both local HPC and cloud compute - Design and operate hybrid burst-to-cloud infrastructure — provision and manage AWS compute resources that extend local HPC capacity on demand - Collaborate with the data pipeline team to ensure infrastructure meets throughput, latency, and reliability requirements - Manage OS patching, driver updates, and GPU runtime environments across HPC nodes - Monitor HPC cluster health, utilization, and capacity to inform scaling decisions Experiment Data Management & Pipelines - Design and operate data ingestion pipelines for high-volume experiment data from lab instruments - Implement tiered storage strategies (hot/warm/cold) to balance accessibility, performance, and cost - Deploy and manage search infrastructure (Elasticsearch/ OpenSearch) to make experiment data universally discoverable and queryable - Build data cataloging and metadata tagging systems so datasets are well-organized and self-describing - Integrate visualization tools (Grafana, Kibana, or similar) to enable engineers and scientists to explore and analyze experiment data - Design data lifecycle policies including retention, archival, and compliance requirements - Ensure data pipelines are reliable, idempotent, and observable with clear error handling and retry logic - Work with engineering and science teams to define data schemas, access patterns, and query requirements Deployment & Release Engineering - Own deployment workflows for software delivered to embedded instruments in our central lab - Manage release processes for a small number of complex, high- value lab-operated instruments - Design deployment strategies that account for rollback, validation, and minimal downtime - Coordinate versioned releases across multiple software components and dependencies - Support development, staging, and production environment parity Logging, Observability & Traceability - Implement centralized log collection and aggregation across cloud and on-site systems - Deploy and manage observability tooling (Prometheus, Grafana, Loki, CloudWatch, or similar) - Ensure structured, searchable logging with clear correlation across services - Build dashboards and alerting for infrastructure health, pipeline status, and deployment state - Establish traceability standards linking builds, tests, artifacts, and deployments - Support diagnostics and post-mortem analysis for production incidents AI-Augmented DevOps - Integrate agentic AI tools into CI/CD workflows to automate code review, test generation, and pipeline troubleshooting - Evaluate and deploy AI-powered assistants for infrastructure management, incident response, and operational tasks - Design guardrails and human-in-the-loop controls for AI-driven automation in production environments - Stay current with the rapidly evolving landscape of AI-augmented development and DevOps tooling - Champion adoption of agentic AI across engineering workflows to accelerate delivery and improve reliability Qualifications Education: BS/MS in Computer Science or Engineering Required: - Experience & Technical Skills - 7+ years of experience in DevOps, CI/CD, or cloud infrastructure roles - Strong, hands-on Linux expertise (administration, debugging, performance tuning) - Deep experience designing and operating CI/CD pipelines (GitHub Actions preferred) - Proven experience managing complex AWS infrastructure at scale - Strong knowledge of Docker including multi-stage builds, registries, and orchestration - Experience with infrastructure as code using Terraform - Experience with Kubernetes and Helm for container orchestration - Solid understanding of versioning strategies, artifact management, and release engineering - Experience integrating agentic AI into DevOps workflows and CI/CD pipelines Programming & Build Systems - Proficiency in Python and shell scripting for automation and tooling - Ability to read, debug, and build C/C++ and CUDA applications on Linux - Experience integrating build systems (CMake, Make) into CI pipelines - Familiarity with package management and dependency resolution across languages Cloud & Infrastructure - Deep AWS experience across IAM, networking (VPC, security groups), storage, compute, and database services - Experience managing on-premises Linux HPC infrastructure alongside cloud resources - Experience designing for high availability, failover, and disaster recovery - Experience with data pipeline and workflow orchestration tools (Step Functions, Airflow, or similar) - Experience with search and indexing platforms (Elasticsearch, OpenSearch, or similar) - Understanding of tiered storage strategies and data lifecycle management - Knowledge of cost management, tagging strategies, and infrastructure governance Observability & Traceability - Experience with logging and monitoring stacks (Prometheus, - Grafana, Loki, ELK, or CloudWatch) - Understanding of build and artifact traceability practices - Experience with structured logging and distributed tracing concepts Preferred: - Experience deploying software to embedded or lab-operated instruments - Experience with high-speed networking (InfiniBand, RDMA, or 10/25/100GbE) in HPC environments - Experience with CUDA build toolchains and GPU-accelerated workloads - Familiarity with Azure or GCP in addition to AWS - Experience in regulated or reliability-sensitive environments - Experience with GitOps workflows and progressive delivery strategies - Familiarity with secrets management (Vault, AWS Secrets Manager) We are an equal opportunity employer. We thrive on diversity and collaboration.