Senior ML Platform Engineer at Mistplay researching and developing machine learning solutions. Collaborating with teams to solve complex business problems and enhance mobile gaming experience.
Responsibilities
Design, build, and operate standardized training-to-deployment pipelines with Airflow, covering artifact management, environment provisioning, packaging, deployment, and rollback for SageMaker endpoints.
Own real-time and batch inference on SageMaker: multi-model endpoints (MME), serverless inference where appropriate, blue/green and canary deployment strategies, autoscaling policies, and cost controls (spot strategies, instance sizing).
Implement very low-latency service models using Redis/Valkey: feature caching, online feature retrieval, request-level state, model response caching, and rate limiting/backpressure for bursty traffic.
Provision and manage ML/data infrastructure with Terraform: SageMaker endpoints/configurations, ECR/ECS/EKS resources, network endpoints/VPCs, ElastiCache/Valkey clusters, observability stacks, secrets, and IAM.
Build platform abstractions and golden paths: Airflow DAG templates, CLI/SDK, cookie-cutter repositories, and CI/CD pipelines that move models from notebooks to production predictably.
Establish and manage model lifecycle governance: model/feature registries, approval workflows, promotion policies, lineage and audit trails integrated with Airflow runs and Terraform state.
Implement end-to-end observability: data/feature freshness checks, drift/quality controls, model performance/latency SLOs, infrastructure health dashboards, tracing and alerts, plus incident response and postmortems.
Collaborate with security, SRE, and data engineering teams on private networks, policy-as-code, handling of PII, least-privilege IAM, and cost-effective architectures across environments.
Evaluate, integrate, and rationalize platform tooling (e.g., MLflow registry, feature stores, service gateways); lead migrations with clear change management and minimal downtime.
Requirements
5+ years of experience building and operating production-grade ML/data platforms focused on service, reliability, and developer experience.
Strong software engineering skills in Python, Go, or Java; experience building resilient services, APIs, and automation tools with high test coverage.
Deep experience with AWS SageMaker inference: endpoint configuration, containerization, model packaging, autoscaling, trade-offs between serverless and real-time, MME, A/B and canary releases.
Expertise with online feature stores such as Redis/Valkey in ML service contexts.
Proven Terraform experience for end-to-end ML and data infrastructure management: modules, workspaces, drift detection, change review, and safe rollbacks; familiarity with GitOps patterns.
Large-scale Airflow orchestration: dependency modeling, sensors, retries, SLAs, backfills, DAG factories, and integrations with registries, artifact stores, and Terraform pipelines.
Familiarity with ML frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow) from a platform integration perspective to support diverse runtimes and containers.
Observability for ML workflows: metrics/logs/traces, performance profiling, capacity planning, cost monitoring, and runbooks.
Excellent cross-functional communication and collaboration with data science, data engineering, DevOps, and backend teams.
Staff ML Infrastructure Engineer building and scaling robust Compute platforms for Simulation and data workflows at GM. Collaborating with engineers to drive efficiency and reliability in AI infrastructure.
IT Infrastructure Engineer managing network and digital infrastructure for Physicians Insurance, a boutique mutual insurance company. Collaborating on design, deployment, and maintenance operations.
Modern Workplace Exchange Infrastructure Architect at Avanade driving end - to - end cloud solutions with Microsoft 365. Collaborating with a large team on enterprise projects for digital transformation.
Infrastructure Specialist supporting enterprise voice platforms including Avaya and RingCentral. Balancing transformation with service stability while working in a hybrid environment.
VP of Technology Infrastructure leading multidisciplinary teams at Early Warning. Managing complex infrastructure and influencing company strategy for payment solutions.
Senior Infrastructure Architect II at Pacific Life defining global infrastructure architecture and ensuring alignment with business objectives. Collaborating cross - functionally to support enterprise - wide initiatives.
Responsible for managing IT infrastructure ensuring service availability and security. Leading support teams and overseeing technical projects for Pierre Fabre in Brazil.
Lead Infrastructure Engineer designing secure automation infrastructure for GE Vernova's digital transformation in utility operations. Collaborate with architects to develop reusable IT solutions.
Infrastructure Engineer managing VMware Server Infrastructure for CMA CGM in the UK. Providing L2/L3 support and ensuring smooth IT operations across client environments.
Infrastructure Engineer responsible for IT infrastructure maintenance and user support. Join One Beyond's innovative team to enhance system reliability and performance while working flexibly.