Distinguished AI/ML Engineer leading technical development of agentic AI systems for Walmart Global Tech. Ensuring system reliability and operational excellence with advanced AI solutions.
Responsibilities
As a Distinguished AI/ML Engineer within Walmart Global Tech’s Reliability Engineering Organization, you will lead the technical development of next-generation agentic AI systems and intelligent automation solutions that ensure mission-critical reliability, scalability, and operational excellence across Walmart’s entire technology ecosystem.
Architect and implement cutting-edge machine learning platforms and autonomous agents that transform how we manage change and performance, monitor, predict, and automatically resolve issues.
Design and implement multi-agent orchestration platforms that coordinate autonomous agents for change management, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.
Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically remediate system issues.
Collaborate with engineering teams and leadership to reduce mean time to detect (MTTD) and mean time to restore (MTTR) through intelligent automation and predictive capabilities.
Requirements
Bachelor’s or Master’s degree in engineering, Computer Science, or a related field with 12+ years of hands-on experience in Reliability Engineering, AI/ML Engineering, or Platform Engineering.
Proven record as a senior individual contributor influencing architecture and driving technical excellence across large organizations.
Deep experience operating mission-critical systems, with expertise in MTTD, MTTR, availability, change management, model performance, and autonomous system reliability.
Expert-level AI/ML engineering experience, including deep learning frameworks such as TensorFlow and PyTorch and large-scale production ML deployments.
Advanced experience with agentic AI systems, including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
Comprehensive Reliability Engineering expertise, including service management (Incident, Problem, and Change Management) and performance and capacity engineering for AI/ML systems.
Expert-level cloud engineering experience (Azure, GCP, AWS) with containerization (Kubernetes, Docker), serverless architectures, and cloud-native AI services.
Deep observability experience across distributed tracing, metrics, logs, APM, and AI-driven anomaly detection.
Strong platform engineering background including infrastructure as code, service mesh architectures, API gateways, and self-service developer platforms.
Benefits
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.
Machine Learning Engineer Intern contributing to AI solutions for financial services. Engaging in hands - on ML projects and real production issues in a hybrid working environment.
Machine Learning Engineer in the CTO division at Open Cosmos developing ML - driven solutions for spacecraft operations. Focused on anomaly detection, forecasting, and decision - making automation.
Develop and automate Machine Learning models for Telecommunications networks during Master’s Thesis Internship. Engage with real - world operational data from Nokia's Microwave Radio technology.
Machine Learning Scientist III developing AI solutions for multi - product domain at Expedia Group. Collaborating with product managers and engineers to optimize travel experiences through machine learning.
Applied AI Engineer at Mistral AI integrating AI products for clients, managing complex technical challenges while working in a collaborative environment.
Staff Machine Learning Engineer at Adobe, leading technical efforts for scalable GenAI services across products like Photoshop and Lightroom. Collaborating closely with research and product teams for high - performance solutions.
AI Engineer building agentic systems and applying AI models for national security initiatives. Collaborating with teams to solve client challenges with cutting - edge AI solutions.
Lead AI/ML Engineer on army enterprise team training and deploying models on cutting - edge AI technologies. Collaborate across teams to solve real - world challenges in the threat landscape.
Machine Learning Scientist III at Expedia Group developing ML algorithms to enhance customer experience. Tackling complex problems in online travel for improved post - booking recommendations and service.
Senior Machine Learning Engineer architecting next - gen Agentic AI systems for enterprise workflows at Demandbase. Focused on multi - agent orchestration and LLM - powered reasoning systems.