MLOps Engineer managing AI pipelines for computer vision models. Involves end-to-end model lifecycle streamlining in a hybrid work environment.
Responsibilities
Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
Introduce and integrate monitoring/telemetry for:
job health and failure analysis (retry, backoff, alerts),
data/feature drift and model performance (precision/recall, latency, throughput),
infra metrics (GPU utilization, memory, I/O, cost).
Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.
Requirements
Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
Machine Learning Engineer at Tilt, developing personalisation solutions across various app surfaces. Collaborate with teams to enhance recommendation systems on a video - first shopping platform.
Senior Machine Learning Engineer architecting next - generation AI platforms for healthcare and fintech with Nitra's diverse team. Focused on data pipelines, ML infrastructure, and production - ready AI systems.
Senior Machine Learning Engineer architecting and building Nitra's data and AI platform. Driving intelligent products across healthcare and fintech industries with applied AI and platform engineering.
Machine Learning Engineer developing and implementing ML models for lending at Blue Whale Lending LLC. Collaborating with teams to enhance data insights and validate model performance.
Applied ML Engineer contributing to machine learning and perception tasks for edge - intelligent maritime systems. Collaborating with cross - functional teams to deliver real - world AI solutions.
AI/ML Engineer building data science and AI solutions for Pharma and MedTech clients on Azure. Collaborating with teams to deliver end - to - end machine learning projects.
ML Engineering Lead at Saris AI tackling multi - modal AI systems in banking. Drive technical direction and build high - performing teams in an early - stage startup environment.
Machine Learning Engineer designing and training lightweight ASR models for mobile devices at Plaud. Contributing to optimization, multilingual data management, and deployment collaboration.
Machine Learning Engineer designing post - processing test suites for AI interaction systems at Plaud Inc. Collaborating on speech algorithm training and user experience optimization in San Francisco.