Lead ML Ops Engineer for a fast-growing AI startup focused on scalable infrastructure. Drive hands-on execution across the entire model lifecycle in a collaborative environment.
Responsibilities
Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters.
Optimize compute usage across distributed systems (Kubernetes, autoscaling, caching, GPU allocation, checkpointing workflows).
Lead the implementation of observability for ML systems (monitor drift, performance, throughput, reliability, cost).
Build automated workflows for dataset curation, labeling, feature pipelines, evaluation, and CI/CD for ML models.
Collaborate with researchers to productionize models and accelerate training/inference pipelines.
Establish ML Ops best practices, internal standards, and cross-team tooling.
Mentor engineers and influence architectural direction across the entire AI platform.
Requirements
Deep hands-on experience designing and operating production ML systems at scale (Staff/Principal-level expected).
Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure).
Proficiency with Python and familiarity with TypeScript or Go for platform integration.
Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM, CUDA / GPU acceleration (practical understanding)
Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling).
Deep understanding of ML lifecycle workflows: training, fine-tuning, evaluation, inference, model registries.
Ability to lead technical strategy, collaborate cross-functionally, and operate in fast-paced environments
Lead Machine Learning Engineer creating personalized item recommendations for Target.com and the Target App. Designing and optimizing production ML solutions with a team of data scientists and engineers.
Senior Machine Learning Engineer at Doctrine focusing on developing NLP models for legal document processing. Join an ambitious team to innovate within the field of legal technology.
Senior ML Engineer developing scalable production ML systems across various teams in JobCloud. Leading innovation in the AI - driven recruitment landscape, improving job ad visibility and performance.
MLOps Engineer responsible for designing and maintaining ML pipelines at JobCloud. Collaborating with teams to productionize ML models and ensuring robust system performance.
Senior Machine Learning Engineer at greehill developing ML solutions for sustainable urban living. Leading projects in Computer Vision and Deep Learning to transform urban environments.
Machine Learning Engineer developing deep learning models for self - driving vehicle systems at BlueSpace.ai. Engage in innovative ML applications while collaborating with a seasoned team in the autonomous vehicle ecosystem.
Machine Learning Engineer developing perception foundation models leveraging multimodal sensor data for autonomous vehicles at Woven by Toyota. Design, implement machine learning solutions, influence Toyota production vehicles.
Senior ML Engineer developing and scaling multitemporal, multimodal models for Earth observation using satellite imagery at LiveEO. The role involves applied research and engineering with real - world impacts.
Senior Machine Learning Engineer developing deep learning models for radiological imaging applications. Focusing on designing and validating ML solutions to support accurate medical diagnoses with interdisciplinary teams.
Senior Machine Learning Engineer involved in speech - to - text projects at Level AI. Designing ASR systems and collaborating in a fast - paced AI environment.