Senior ML Platform Engineer at GEICO designing scalable infrastructure for machine learning. Focusing on Large Language Models while leading infrastructure and platform engineering initiatives.
Responsibilities
ML Platform & Infrastructure Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools
Implement automated model training, validation, deployment, and monitoring workflows
Set up comprehensive observability using Prometheus, Grafana, Azure Monitor, and custom dashboards
Continuously optimize platform performance, reducing latency and improving throughput for ML workloads
Design and implement backup, recovery, and business continuity plans for ML platforms
Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations
Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability
Design and deliver technical onboarding programs for new team members joining the ML platform team
Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures
Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications
Requirements
Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
7+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python; strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Triton Inference Server, or similar.
Benefits
Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family's overall well-being.
Financial benefits including market-competitive compensation; a 401K savings plan vested from day one that offers a 6% match; performance and recognition-based incentives; and tuition assistance.
Access to additional benefits like mental healthcare as well as fertility and adoption assistance.
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year.
Software Engineer at Pico Technology translating strategic objectives into robust, secure software solutions. Leading software architecture and coding efforts while collaborating with cross - functional teams.
Lead Technique IA responsible for designing and implementing AI and BI solutions at Genia. Supporting clients in their digital transformation while leveraging cloud services and data engineering practices.
Join Snap Inc. as a Level 3 Software Engineer to work on various challenging technical projects. Develop code that impacts Snap’s products and technology, and collaborate with dynamic teams.
Software developer enhancing and maintaining production test environments using Python and Qt at BDT, a leader in smart technology solutions. Collaborating on product introduction and process optimization with international partners.
Platform Enabling Software Engineer developing graphics drivers across integrated and discrete graphics for Intel. Adapting driver functionality for HW changes and collaborating with upstream communities.
As a Staff Software Development Engineer at CVS Health, lead transformative integration programs. Focus on enhancing customer service solutions and architectural frameworks.
Software Architect responsible for developing ERP solutions on Microsoft Business Central and ensuring system architecture stability. Collaborating closely with product management and working within a Scrum team to shape ERP future.
Senior Software Engineer collaborating with Computational Structural Engineers to develop automation tools for Engineering Design using various Python libraries.
Software Engineer building a next - generation CMS and web platforms at Mistral AI. Collaborating with marketing and engineering teams to enhance digital content management.
Software Engineer II in Workday Integration at Travelers, leading design and development for system assignments. Engage with stakeholders to deliver technical solutions efficiently and effectively.