Hybrid MLOps Engineer

Posted 6 days ago

Apply now

About the role

  • Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
  • Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
  • Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
  • Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
  • Introduce and integrate monitoring/telemetry for:
  • job health and failure analysis (retry, backoff, alerts),
  • data/feature drift and model performance (precision/recall, latency, throughput),
  • infra metrics (GPU utilization, memory, I/O, cost).
  • Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
  • Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
  • Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
  • Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
  • Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.

Requirements

  • Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
  • Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
  • GitLab CI/CD expertise (modular templates, YAML optimization, build/test stages for ML, environment promotion).
  • Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
  • Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
  • Experience operating GPU workloads: NVIDIA driver/CUDA stack, container runtimes, device plugins (k8s), multi-GPU training, utilization tuning.
  • Observability & monitoring for ML and infra: Prometheus/Grafana, OpenTelemetry/Loki (or similar) for metrics, logs, traces; alerting and SLOs.
  • Experiment tracking / model registry with tools like MLflow or Weights & Biases (runs, params, artifacts, metrics, registry/promotion).
  • Data versioning & validation: DVC/lakeFS (or similar), Great Expectations/whylogs, schema checks, drift detection.
  • Cloud services: GCP (Compute Engine, GKE or Autopilot, Cloud Run, Artifact Registry, Cloud Storage, Pub/Sub). Equivalent AWS/Azure experience is acceptable.
  • Security & compliance for ML stacks: secrets management, SBOM/image scanning, least-privilege IAM, network policies, artifact signing.
  • Solid understanding of containerized deployment patterns (blue-green/canary), rollout strategies, and rollback safety.

Benefits

  • Salary from **2,500 EUR to 5,500 EUR per month** (before Taxes)
  • A Birthday Gift
  • **After Probationary Period **
  • **Health Insurance**
  • **Health Recovery Days **(which can be taken as you need)
  • Paid **Study Leave**
  • Funding for the purchase of **Vision Glasses **after one (1) year of service

Job title

MLOps Engineer

Job type

Experience level

Mid levelSenior

Salary

€2,500 - €5,500 per month

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job