Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
Introduce and integrate monitoring/telemetry for:
job health and failure analysis (retry, backoff, alerts),
data/feature drift and model performance (precision/recall, latency, throughput),
infra metrics (GPU utilization, memory, I/O, cost).
Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.
Requirements
Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
Innovation Engineer responsible for AI - driven solutions at a digital commerce company. Focused on prototyping, exploring technologies, and shaping technology strategy.
Senior ML Engineer developing scalable machine learning systems for FOX advertising platform. Collaborating on ML solutions that optimize ad personalization and monetization.
Senior AI/ML Engineer developing machine learning tools for quantum error correction at Riverlane. Collaborating with researchers to deliver innovative AI solutions in quantum computing.
Applied Machine Learning Scientist validating Generative AI models for TD. Responsible for model validation and communicating findings to stakeholders while fostering collaborations.
Senior Software Engineer developing machine learning geospatial products for Planet. Collaborating with engineers and scientists on innovative remote sensing analytics.
Machine Learning Engineer responsible for optimizing AI pipelines at Easy2Parts. Join a growing team to revolutionize component sourcing with AI technology.
AI/ML Engineer developing and deploying machine learning solutions for Nokia's network optimization projects. Collaborating with cross - functional teams to enhance network planning capabilities.
Machine Learning Platform Engineer for Coinbase, building foundational components for ML at scale. Collaborating on fraud combat, personalizing user experiences, and blockchain analysis.
Machine Learning Engineer focused on building sophisticated models to protect Coinbase users from fraud. Engaging in hands - on technical role with modern AI/ML methodologies.
Senior ML Platform Engineer developing and maintaining scalable ML infrastructure at GEICO. Focused on Large Language Models and collaborating with data science and engineering teams.