Staff Machine-Learning Infrastructure Engineer developing ML infrastructure for Voxel, enhancing workplace safety through AI and computer vision technology.
Responsibilities
Own data & labeling pipelines – architect scalable labeling services (storage, query, retrieval), design ontologies, automate annotation workflows, and build quality-tiered datasets that stay within cost constraints.
Build and operate training infrastructure – create multi-GPU / multi-node training frameworks (Ray, Spark, Kubernetes), optimize distributed jobs, and integrate accelerators (TensorRT, CUDA-graph, FP8, etc.).
Manage the full model lifecycle – stand up model registries, version control, evaluation suites, and continuous-learning loops that push updates from dev → staging → prod with zero-downtime rollbacks.
Provide technical leadership, mentorship, and lightweight project management to a small infra + research squad.
Establish DevOps-for-ML best practices (IaC, CI/CD, observability, cost monitoring) so researchers can iterate quickly and safely.
Partner with ML engineers on architecture decisions, from data schemas to inference optimizations, ensuring infra and research road-maps stay tightly aligned.
Requirements
Bachelor’s (or higher) in Computer Science, EE, or related field.
5+ years building and operating large-scale infrastructure, with at least 3 years focused on ML or data-intensive systems.
Proven record designing highly available, distributed systems on Kubernetes (EKS, GKE, or on-prem).
Deep expertise with orchestration (K8s operators, Argo, Kubeflow), and cluster-scale storage / compute (S3, GCS, Ray, Spark, Dask).
Hands-on experience automating data-labeling or ground-truth workflows and maintaining dataset versioning.
Strong software-engineering fundamentals; familiar with best practices for testing, observability, and secure coding.
AI Engineer at Trunk Tools revolutionizing construction with intelligent automation and production - ready AI agents. Leading design and implementation of multi - agent systems for document and data processing.
Audio Machine Learning Co - op developing real - time AI - powered audio processing algorithms for Bose. Collaborating with experts to prototype and implement novel ML algorithms for various applications.
AI Center of Excellence Engineer at F5 supporting applied AI research, prototyping, and engineering initiatives. Evaluating AI techniques and creating integration recommendations for production systems.
Senior ML Engineer at Centra developing forecasting and AI - driven decision support for fashion brands. Collaborating to enhance ecommerce through machine learning and insights.
Staff ML/AI Engineer for healthcare communication solutions at Accurx. Leading AI/ML initiatives to enhance patient communication and healthcare efficiency.
Senior Machine Learning Engineer developing ML systems for healthcare communication technology at Accurx. Join our mission - driven team to solve real - world problems in healthcare.
Senior Developer at Valorem Reply delivering ML/AI applications on AWS. Collaborating with product and engineering teams to provide high - quality tech solutions.
Senior Developer building and evolving ML/AI applications on AWS for Valorem Reply. Collaborating closely with product, architecture, and engineering teams for quality solutions.
Senior Software Engineer designing and operating ML infrastructure for Plaid's AI initiatives. Collaborating with product teams to accelerate AI - powered financial experiences and ensure scalable ML systems.
Staff AI Engineer at GEICO designing and deploying AI platforms for virtual agent workflows. Collaborating with teams to improve service for millions of customers.