Senior HPC Operations Engineer

Lambda

San Francisco, CA

Full-time

10 days ago

Full-time

HPCAICluster ManagementLinuxDeep Learning

Job Description

Lambda, founded in 2012, is an AI company built by AI engineers, aiming to be the world's top AI computing platform.

Deploy and configure large-scale HPC clusters remotely (up to thousands of nodes)
Install and configure OS, firmware, software, and networking
Troubleshoot and resolve HPC issues on-site in collaboration with physical teams
Specify requirements for engineering teams to improve stability, simplification, and efficiency
Create and maintain Standard Operating Procedures
Provide updates to project leads
Mentor junior engineers
Stay current with HPC/AI technologies

Expert in HPC engineering, especially logical provisioning
Strong understanding of HPC/AI architecture, OS, firmware, software, and networking
10+ years in HPC cluster deployment for AI
Detail-oriented
Experience with Bright Cluster Manager or similar
Proficient in:
- SFP+ fiber, Infiniband (IB), 100 GbE networks
- Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod
- Linux nodes, firmware, drivers
- Job schedulers like SLURM or Kubernetes
Capable of working under deadlines and structured plans
Excellent troubleshooting skills
Willing to travel to data centers
Able to work independently and mentoring

Lambda is committed to diversity and equal employment opportunities.

Position: HPC Engineer for Large-Scale AI Clusters
Experience: 10+ years in HPC deployment and configuration
Expertise: HPC/AI architecture, troubleshooting, Linux systems, network fabrics
Tools & Technologies: Bright Cluster Manager, SLURM, Kubernetes, Infiniband, 100 GbE, Docker, PyTorch, TensorFlow
Location: San Francisco (4 days/week in office)
Salary Range: $171,000 - $246,000
Benefits: Competitive salary, health coverage, 401k, flexible PTO
Travel: Occasional travel to North American data centers
Core Skills: Detailed troubleshooting, team mentorship, project management
Nice to Have: Experience with ML frameworks, container technologies, cloud infrastructure

Join Lambda to build the world's best deep learning cloud platform.