Senior Platform Engineer – ML Infrastructure
We’re hiring a Senior Platform Engineer specializing in ML Infrastructure within the AI/ML and Deep-Tech industry.
About the Role
Join our core infrastructure team to design and scale foundational systems powering AI products. We seek passionate engineers committed to building robust, efficient, and innovative ML infrastructure.
Job Requirements
- DevOps Experience: Docker, Kubernetes, shell scripting
- Distributed GPU & ML Infrastructure: Working with distributed GPU systems, training, and serving ML models at scale
- Tools & Systems: Data annotation, data curation, model registry, model serving, workflow orchestration
- ML Systems Expertise: Building LLM-powered systems like RAG and agentic workflows
- Troubleshooting Skills: Resolving CUDA errors and preventing them
- Kubernetes: Multi-node setup, managed services, kube-native tooling
- Configuration Management: Ansible, Terraform, Pulumi
- Data Pipelines: Setting up data lakes and processing pipelines
- High Availability: Services and database setup
- Cloud Platforms: AWS, Azure, GCP
- Networking & Security: Load balancing, DNS, proxies, RBAC
- CI/CD: Building deployment pipelines
- ML Orchestration: Kubeflow, Flyte, Prefect
- Distributed Computing: Ray
- ML & Data Versioning: Model/data orchestration and versioning
- Distributed Application Design: Modern distributed systems understanding
Job Responsibilities
- Own infrastructure for on-premise and cloud deployments
- Collaborate with development teams for scalable applications
- Establish and promote best practices
- Manage reporting to senior leadership
- Foster a collaborative, innovative, inclusive environment
- Mentor team members on platform-first mindset
Preferred Attributes
Passion for impactful technology and shared vision. Willingness to apply even if not all requirements are met.
Skills Summary
ml orchestration, workflow orchestration, data pipelines, cuda troubleshooting, networking, distributed GPU systems, k8s, cloud services, Docker, model registry, mlops, devops, data curation, high availability, RBAC, role-based access, and more.