Job Description
Lambda, founded in 2012, is an AI company built by AI engineers, aiming to be the world's top AI computing platform.
About Lambda
- Founded: 2012
- Employees: ~350 (2024)
- Mission: Make computation as effortless and ubiquitous as electricity
- Product: AI Cloud for deployment of AI with hardware to cloud solutions
- Clients: Leading companies and research institutions
- Work Environment: In-office in San Francisco 4 days/week (Tuesday WFH)
What You’ll Do
- Deploy and configure large-scale HPC clusters remotely (up to thousands of nodes)
- Install and configure OS, firmware, software, and networking
- Troubleshoot and resolve HPC issues on-site in collaboration with physical teams
- Specify requirements for engineering teams to improve stability, simplification, and efficiency
- Create and maintain Standard Operating Procedures
- Provide updates to project leads
- Mentor junior engineers
- Stay current with HPC/AI technologies
You
- Expert in HPC engineering, especially logical provisioning
- Strong understanding of HPC/AI architecture, OS, firmware, software, and networking
- 10+ years in HPC cluster deployment for AI
- Detail-oriented
- Experience with Bright Cluster Manager or similar
- Proficient in:
- SFP+ fiber, Infiniband (IB), 100 GbE networks
- Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod
- Linux nodes, firmware, drivers
- Job schedulers like SLURM or Kubernetes
- Capable of working under deadlines and structured plans
- Excellent troubleshooting skills
- Willing to travel to data centers
- Able to work independently and mentoring
Nice to Have
- ML/DL frameworks (PyTorch, TensorFlow)
- Benchmarking tools (DeepSpeed, MLPerf)
- Containerization (Docker, Kubernetes)
- Cloud tech (GPU acceleration, virtualization)
- Diplomatic customer engagement
- Degree in EE, CS, Physics, Mathematics or equivalent experience
Compensation & Benefits
- Salary Range: $171,000 - $246,000 annually
- Generous cash and equity options
- Health, dental, vision for dependents
- Work-from-home stipends
- 401k plan (US)
- Flexible PTO
Equal Opportunity
Lambda is committed to diversity and equal employment opportunities.
Job Highlights
Responsibilities
- Deploy and configure large-scale HPC clusters remotely
- Install and configure OS, firmware, software, and network components
- Troubleshoot HPC clusters
- Define and communicate requirements
- Create and maintain SOPs
- Update project leads
- Mentor team members
- Keep current with new HPC/AI technologies
Qualifications
- Deep HPC experience, particularly in logical provisioning
- Strong knowledge in HPC/AI architecture, systems, and networking
- 10+ years of HPC deployment experience
- Attention to detail
- Experience with cluster management tools
- Troubleshooting expertise in network fabrics, Ethernet, GPU environments
- Linux expertise
- Experience with job schedulers
- Flexible and team-oriented
- Mentoring skills
- Experience with ML/DL frameworks and containers is a plus
- Bachelor's degree or equivalent experience
Benefits
- Salary: $171K - $246K
- Competitive compensation and equity
- Comprehensive health benefits
- Work stipends
- Retirement plan
- PTO
Tags
HPC, AI, Cluster Management, Linux, Cloud Computing