Hybrid Senior ML Platform Engineer II

Posted last month

Apply now

About the role

  • Senior ML Platform Engineer at Mistplay researching and developing machine learning solutions. Collaborating with teams to solve complex business problems and enhance mobile gaming experience.

Responsibilities

  • Design, build, and operate standardized training-to-deployment pipelines with Airflow, covering artifact management, environment provisioning, packaging, deployment, and rollback for SageMaker endpoints.
  • Own real-time and batch inference on SageMaker: multi-model endpoints (MME), serverless inference where appropriate, blue/green and canary deployment strategies, autoscaling policies, and cost controls (spot strategies, instance sizing).
  • Implement very low-latency service models using Redis/Valkey: feature caching, online feature retrieval, request-level state, model response caching, and rate limiting/backpressure for bursty traffic.
  • Provision and manage ML/data infrastructure with Terraform: SageMaker endpoints/configurations, ECR/ECS/EKS resources, network endpoints/VPCs, ElastiCache/Valkey clusters, observability stacks, secrets, and IAM.
  • Build platform abstractions and golden paths: Airflow DAG templates, CLI/SDK, cookie-cutter repositories, and CI/CD pipelines that move models from notebooks to production predictably.
  • Establish and manage model lifecycle governance: model/feature registries, approval workflows, promotion policies, lineage and audit trails integrated with Airflow runs and Terraform state.
  • Implement end-to-end observability: data/feature freshness checks, drift/quality controls, model performance/latency SLOs, infrastructure health dashboards, tracing and alerts, plus incident response and postmortems.
  • Collaborate with security, SRE, and data engineering teams on private networks, policy-as-code, handling of PII, least-privilege IAM, and cost-effective architectures across environments.
  • Evaluate, integrate, and rationalize platform tooling (e.g., MLflow registry, feature stores, service gateways); lead migrations with clear change management and minimal downtime.

Requirements

  • 5+ years of experience building and operating production-grade ML/data platforms focused on service, reliability, and developer experience.
  • Strong software engineering skills in Python, Go, or Java; experience building resilient services, APIs, and automation tools with high test coverage.
  • Deep experience with AWS SageMaker inference: endpoint configuration, containerization, model packaging, autoscaling, trade-offs between serverless and real-time, MME, A/B and canary releases.
  • Expertise with online feature stores such as Redis/Valkey in ML service contexts.
  • Proven Terraform experience for end-to-end ML and data infrastructure management: modules, workspaces, drift detection, change review, and safe rollbacks; familiarity with GitOps patterns.
  • Large-scale Airflow orchestration: dependency modeling, sensors, retries, SLAs, backfills, DAG factories, and integrations with registries, artifact stores, and Terraform pipelines.
  • Familiarity with ML frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow) from a platform integration perspective to support diverse runtimes and containers.
  • Observability for ML workflows: metrics/logs/traces, performance profiling, capacity planning, cost monitoring, and runbooks.
  • Excellent cross-functional communication and collaboration with data science, data engineering, DevOps, and backend teams.

Benefits

  • Team lunches
  • Game nights
  • Company-wide events

Job title

Senior ML Platform Engineer II

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job