Hybrid Machine Learning Operations Engineer

Posted last month

Apply now

About the role

  • MLOps Engineer developing platforms for machine learning and generative AI products at Nuvei. Collaborating with Data Scientists and Engineers to ensure reliability and governance of models moved to production.

Responsibilities

  • Operate & Develop ML/LLM platforms on Kubernetes + cloud (Azure; AWS/GCP ok) with Docker, Terraform, and other relevant tools
  • Manage object storage, GPUs, and autoscaling for training & low-latency model serving
  • Manage cloud environment, networking, service mesh, secrets, and policies to meet PCI-DSS and data-residency requirements
  • Build end-to-end CI/CD for models/agents/MCP tooling (versioning, tests, approvals)
  • Deliver real-time fraud/risk scoring & agent signals under strict latency SLOs.
  • Maintain MCP servers/clients: tool/resource definitions, versioning, quotas, isolation, access controls
  • Integrate agents with microservices, event streams, and rule engines; provide SLAs, tracing, and on-call runbooks
  • Measure operational metrics of ML/LLM (latency, throughput, cost, tokens, tool success, safety events)
  • Enforce governance: RBAC/ABAC, row-level security, encryption, PII/secrets management, audit trails.
  • Partner with DS on packaging (wheels/conda/containers), feature contracts, and reproducible experiments.
  • lead incident response and post-mortems.
  • Drive FinOps: right-sizing, GPU utilization, batching/caching, budget alerts.

Requirements

  • 4+ years in DevOps/MLOps/Platform roles building and operating production ML systems (batch and real-time)
  • Strong hands-on with Kubernetes, Docker, Terraform/IaC, and CI/CD
  • Practical experience with Spark/Databricks and scalable data processing
  • Proficiency in Python & Bash
  • Ability to operate DS code and optimize runtime performance.
  • Experience with model registries (MLflow or similar), experiment tracking, and artifact management.
  • Production model serving using FastAPI/Ray Serve/Triton/TorchServe, including autoscaling and rollout strategies
  • Monitoring and tracing with Prometheus/Grafana/OpenTelemetry; alerting tied to SLOs/SLAs
  • Solid understanding of PCI-DSS/GDPR considerations for data and ML systems
  • Experience with the Azure cloud environment is a big plus
  • Operating LLM/agent workloads in production (prompt/config versioning, tool execution reliability, fallback/retry policies)
  • Building/maintaining RAG stacks (indexing pipelines, vector DBs, retrieval evaluation, hybrid search)
  • Implementing guardrails (policy checks, content filters, allow/deny lists) and human-in-the-loop workflows
  • Experience with feature stores - Qwak Feature Store, Feast
  • A/B testing for models and agents, offline/online evaluation frameworks
  • Payments/fraud/risk domain experience; integrating ML outputs with rule engines and operational systems - Advantage
  • Familiarity with Databricks Unity Catalog, dbt, or similar tooling

Benefits

  • Private Medical Insurance
  • Office and home hybrid working
  • Global bonus plan
  • Volunteering programs
  • Prime location office close to Tel Aviv train station

Job title

Machine Learning Operations Engineer

Job type

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job