Job Title: MLOps and Data Engineer
Company Overview
We are a rapidly growing, well-funded startup with 25 employees, specializing in cutting-edge generative video technology.
Position
Senior MLOps & Data Engineer — Join our expanding team to help drive innovation.
Responsibilities
- Own and enhance ML data and training pipelines:
- From data ingestion to deployment
- Ensuring scalability, efficiency, and reliability
- Build, deploy, and manage distributed training jobs:
- Using frameworks like Megatron
- Across multi-node clusters
- Design and oversee robust data processing workflows:
- Supporting real-time and batch operations
- Collaborate with research and engineering teams:
- Transition ML models from prototype to production
- Optimize inference pipelines:
- Achieving low-latency and cost-effectiveness
- Monitor and improve system performance, reliability, and security throughout the ML lifecycle
Key Qualifications
- 4+ years in MLOps, ML infrastructure, or data engineering
- Practical experience with distributed training on GPU clusters
- Deep knowledge of MLOps best practices and tools:
- CI/CD, experiment tracking, versioning
- Experience with cloud platforms:
- AWS or GCP
- Programming skills:
- Python and associated ML tools (PyTorch, TensorFlow, Airflow, Kubeflow, etc.)
- Expertise in inference optimization for large models:
- Focusing on latency, throughput, and cost
- Strong system design capabilities:
- Distributed systems, orchestration, monitoring
- Bonus:
- Experience with Megatron, TPU/GPU optimization or large-model training frameworks
Application
Please apply for more details.
## Job Highlights
Qualifications
- 4+ years in relevant roles
- Hands-on experience with GPU cluster training
- Deep understanding of MLOps practices and tools
- Experience with AWS or GCP
- Proficiency in Python and ML tools
- Experience with large model inference optimization
- Strong distributed systems and monitoring skills
- Bonus: Megatron, TPU/GPU training frameworks
Responsibilities
- Own and improve ML data and training pipelines
- Build and manage distributed training jobs
- Design data processing workflows
- Collaborate to productionize ML models
- Optimize inference pipelines
- Continuous system performance and security improvements