Job Description: Machine Learning Engineer (Infrastructure Focus)
Role Overview
An exciting opportunity to work as a Machine Learning Engineer specializing in infrastructure. You will be responsible for designing and building tools, frameworks, and systems that enable efficient training, deployment, and scaling of machine learning models. Your expertise will support cutting-edge challenges in model optimization, infrastructure automation, and distributed computing to facilitate high-performance AI/ML workflows.
Key Responsibilities
- Develop and maintain ML infrastructure for distributed model training and inference
- Implement tools for model versioning, experiment tracking, and automated deployments
- Optimize ML pipelines to enhance training and inference efficiency at scale
- Collaborate with data scientists and engineers to integrate ML workflows with existing systems
- Monitor and ensure the reliability, security, and performance of the ML infrastructure
Requirements
- Experience with ML frameworks (e.g., TensorFlow, PyTorch, JAX)
- Knowledge of MLOps tools (e.g., MLflow, Kubeflow, Airflow)
- Proficiency in containerization and orchestration tools (e.g., Docker, Kubernetes)
- Strong programming skills in Python and familiarity with CI/CD pipelines
- Understanding of distributed training methods and hardware acceleration (GPUs, TPUs)
- Experience working with large language models and models exceeding 10B parameters
Job Highlights
This role offers an opportunity to work at the forefront of ML infrastructure development, with a focus on scalable and efficient machine learning systems, and collaboration across teams to innovate and improve AI/ML workflows.