Senior ML Platform Engineer at Mistplay researching and developing machine learning solutions. Collaborating with teams to solve complex business problems and enhance mobile gaming experience.
Responsibilities
Design, build, and operate standardized training-to-deployment pipelines with Airflow, covering artifact management, environment provisioning, packaging, deployment, and rollback for SageMaker endpoints.
Own real-time and batch inference on SageMaker: multi-model endpoints (MME), serverless inference where appropriate, blue/green and canary deployment strategies, autoscaling policies, and cost controls (spot strategies, instance sizing).
Implement very low-latency service models using Redis/Valkey: feature caching, online feature retrieval, request-level state, model response caching, and rate limiting/backpressure for bursty traffic.
Provision and manage ML/data infrastructure with Terraform: SageMaker endpoints/configurations, ECR/ECS/EKS resources, network endpoints/VPCs, ElastiCache/Valkey clusters, observability stacks, secrets, and IAM.
Build platform abstractions and golden paths: Airflow DAG templates, CLI/SDK, cookie-cutter repositories, and CI/CD pipelines that move models from notebooks to production predictably.
Establish and manage model lifecycle governance: model/feature registries, approval workflows, promotion policies, lineage and audit trails integrated with Airflow runs and Terraform state.
Implement end-to-end observability: data/feature freshness checks, drift/quality controls, model performance/latency SLOs, infrastructure health dashboards, tracing and alerts, plus incident response and postmortems.
Collaborate with security, SRE, and data engineering teams on private networks, policy-as-code, handling of PII, least-privilege IAM, and cost-effective architectures across environments.
Evaluate, integrate, and rationalize platform tooling (e.g., MLflow registry, feature stores, service gateways); lead migrations with clear change management and minimal downtime.
Requirements
5+ years of experience building and operating production-grade ML/data platforms focused on service, reliability, and developer experience.
Strong software engineering skills in Python, Go, or Java; experience building resilient services, APIs, and automation tools with high test coverage.
Deep experience with AWS SageMaker inference: endpoint configuration, containerization, model packaging, autoscaling, trade-offs between serverless and real-time, MME, A/B and canary releases.
Expertise with online feature stores such as Redis/Valkey in ML service contexts.
Proven Terraform experience for end-to-end ML and data infrastructure management: modules, workspaces, drift detection, change review, and safe rollbacks; familiarity with GitOps patterns.
Large-scale Airflow orchestration: dependency modeling, sensors, retries, SLAs, backfills, DAG factories, and integrations with registries, artifact stores, and Terraform pipelines.
Familiarity with ML frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow) from a platform integration perspective to support diverse runtimes and containers.
Observability for ML workflows: metrics/logs/traces, performance profiling, capacity planning, cost monitoring, and runbooks.
Excellent cross-functional communication and collaboration with data science, data engineering, DevOps, and backend teams.
Infrastructure Engineer maintaining high availability of systems at mortgage platform provider Pylon. Focus on developer productivity and codebase quality with instant feedback from peers.
Infrastructure Systems Engineer II managing production application support for Conduent. Collaborating on ITIL processes and incident management while working in a 24/7 environment.
OT Cybersecurity Specialist responsible for secure IT - OT infrastructures in industrial operations. Engaging in secure deployments, integrating cybersecurity frameworks, and providing expert support.
Ingeniero de Infraestructura y Seguridad colaborando en el diseño de arquitecturas seguras en CRG Solutions. Integrando buenas prácticas de ciberseguridad y gestionando incidentes en entornos Windows y Linux.
Senior Infrastructure Engineer managing global IT infrastructure for aviation solutions, focusing on VMware, Nutanix, and Windows Server environments. Collaborating with teams to ensure high availability and optimal performance in a hybrid work model.
Cloud Support Engineer maintaining operational stability and automation for Azure cloud platforms. Working collaboratively across IT teams to ensure infrastructure reliability and security.
Database Engineer at Aircall building tooling for database management and observability. Working in a fast - paced environment for an innovative customer communications platform.
Lead Cloud Infrastructure Engineer at Paramount managing cloud architecture and infrastructure initiatives across environments. Involved in automation, scalability, and mentoring infrastructure engineers.
Senior Infrastructure Engineer specializing in Cisco and VMware to modernize hybrid environments for strategic partners. Ownership and mentorship role within a collaborative IT team.
Data Cloud & Infrastructure Architect connecting BigQuery potential with Salesforce execution. Mastering identity resolution and driving real - time data orchestration in a hybrid environment.