Principal Site Reliability Engineer enhancing Walmart's customer service platforms for operational excellence. Leading automation and reliability strategies in a large-scale tech environment.
Responsibilities
Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and “you build it, you run it.”
Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.
Requirements
10+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.
A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.
Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.
Deep expertise in distributed systems, service ownership models, CI/CD pipelines, and observability practices.
Exceptional communication and collaboration skills, with a track record of influencing cross-functional teams and driving consensus on reliability strategies.
Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.
A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.
Benefits
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
Walmart-paid education benefit program for full-time and part-time associates, covering tuition, books, and fees.
DevOps Engineer helping deploy MVP, CRM, and billing systems for Newrich Network. Focused on infrastructure, automation, and building for scale with potential to go full - time.
Cloud Operations Engineer supporting and maintaining multi - cloud public infrastructure for enterprise customers. Working in structured ITIL environment and contributing to operational excellence.
DevOps Engineer building and maintaining authentication platforms in multi - cloud environments. Using technologies like Terraform, Ansible, and Python for automation and optimization.
Cloud Engineer developing Infrastructure - as - Code with Terraform and Azure DevOps. Managing Azure infrastructure and leading incident response within cross - functional teams.
DevSecOps Engineer at Skillfield working on secure CI/CD pipelines for mobile - first delivery. Collaborating with teams to embed security and automation in the delivery lifecycle.
Lead DevOps Engineer focused on AWS and Azure data platform solutions. Collaborating with teams to deliver scalable, secure, and highly available solutions.
DevOps Engineer working at GRÜN Software Group to automate and maintain stable infrastructures. Collaborating with teams to improve deployments and processes for better performance.
Linux System Administrator managing IT infrastructures for educational institutions and research. Collaborating on DevOps and HPC projects while ensuring system security and performance.
Azure SRE Engineer responsible for designing and maintaining secure, scalable Azure cloud infrastructure. Driving automation and operational excellence for leading organizations in technology transformation.