NVIDIA DGX Cloud AI Infrastructure Software Engineer
Joining NVIDIA's DGX Cloud AI Efficiency Team involves contributing to robust infrastructure supporting cutting-edge AI research. Our team focuses on optimizing AI workloads' efficiency and resiliency, developing scalable AI and data infrastructure tools, and delivering a stable environment for AI innovators.
Role Overview
As an AI Infrastructure Software Engineer, you will design, build, and maintain AI infrastructure enabling large-scale training and inference. Your work will involve implementing software and systems engineering practices to ensure high efficiency and availability.
What You’ll Be Doing
- Develop infrastructure software and tools for large-scale AI, LLM, and GenAI.
- Improve infrastructure efficiency and resiliency through tool development.
- Troubleshoot and root cause analysis from application to hardware failures.
- Enhance AI platform infrastructure and products.
- Co-design APIs for NVIDIA's resiliency stacks.
- Define and track system reliability metrics.
- Problem-solving, root cause analysis, and system optimization.
Requirements
- 8+ years of experience in AI infrastructure software development.
- Bachelor’s degree or higher in Computer Science or related field.
- Strong debugging and triage skills.
- Proven experience with large-scale distributed systems.
- Experience with AI training, inference, and data infrastructure.
- Familiarity with large-scale observability platforms (ELK, Prometheus, Loki).
- Proficiency in Python, C/C++, scripting languages.
- Excellent communication and collaboration skills.
Preferred Qualifications
- Experience with large-scale AI clusters.
- Understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL).
- Familiarity with deep learning frameworks (PyTorch, TensorFlow, JAX, Ray).
- Experience in failure analysis at datacenter scale.
- Strong software design and development skills.
About NVIDIA
NVIDIA leads in AI, High-Performance Computing, and Visualization, powering innovations from autonomous vehicles to advanced AI research.
Compensation
- Salary: $184,000 - $356,500 (based on location, experience, and market standards).
- Eligible for equity and benefits.
Commitment to Diversity
NVIDIA is an equal opportunity employer committed to diversity and inclusion.
Job Reference: JR1997415