Director of Site Reliability Engineering at Mastercard, overseeing resilience and operational excellence initiatives. Leading a high-performing team of technical leaders within CX Technology.
Responsibilities
Lead and develop a team of highly skilled people leaders and senior individual contributors within the CX Technology organization, fostering a culture of accountability, innovation, and continuous improvement
Define and drive the short-term and medium-term strategic vision for Site Reliability Engineering, aligning reliability, scalability, and operational efficiency initiatives with broader Mastercard technology and business objectives
Lead the design and execution of cross-functional initiatives that improve system resilience, automate operational processes, and mature incident management, problem management, and reliability engineering practices
Establish, evolve, and govern reliability standards, operational best practices, and control frameworks to ensure consistent adoption across engineering and delivery teams
Partner closely with engineering, product, architecture, and business stakeholders to embed reliability requirements into system design, development, deployment, and lifecycle management processes
Oversee major incident response and escalation efforts, ensuring rapid recovery, effective communication, and high-quality root cause analysis with actionable remediation
Promote proactive risk identification and mitigation through observability, capacity planning, resiliency testing, and automation-driven approaches
Champion continuous improvement by leveraging operational metrics, insights, and retrospectives to drive measurable improvements in availability, stability, and customer experience
Stay informed on industry trends, emerging technologies, and modern SRE practices, applying relevant innovations to advance Mastercard’s operational maturity
Manage goal setting, coaching, performance management, and talent development for people leaders and senior technologists, building a strong leadership pipeline and sustaining operational excellence at scale.
Requirements
Proven experience leading Site Reliability Engineering, Production Engineering, or large-scale operations teams within complex, highly available, distributed technology environments
Strong people leadership background, including managing managers and/or senior technical leaders, with demonstrated success building high-performing, inclusive teams
Deep understanding of reliability engineering principles, including incident management, automation, telecom, observability, resilience engineering, capacity planning, and service lifecycle management
Demonstrated ability to translate strategy into execution by evolving processes, programs, and policies to drive meaningful and measurable operational improvements
Experience partnering across engineering, product, and business functions to influence design decisions and embed reliability throughout the development lifecycle
Strong analytical and problem-solving skills, with a track record of driving root cause analysis and long-term corrective actions
Excellent communication and stakeholder management skills, with the ability to lead through influence at senior and executive levels
Passion for continuous improvement, operational discipline, and leveraging technology to reduce toil and improve system outcomes at scale
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience; advanced degree preferred.
Benefits
Must abide by Mastercard’s security policies and practices
Ensure the confidentiality and integrity of the information being accessed
Report any suspected information security violation or breach
Complete all periodic mandatory security trainings
Staff Software Engineer joining Site Reliability team ensuring performance and reliability of legal AI platform. Designing monitoring and alerting systems while managing operations across global regions.
Senior SRE Technical Lead responsible for reliability and scalability at Adobe's RealTime Customer Data Platform. Overseeing incident response and core datastore strategy in a high impact role.
SRE responsible for designing and maintaining cloud infrastructure to support scalable applications. Collaborating with product teams to enhance monitoring and response systems in the Czech Republic.
Vehicle Reliability Engineer identifying and resolving issues for Waabi, a leader in Physical AI for autonomous transportation. Collaborating across teams to enhance vehicle reliability and performance.
DevOps Engineer responsible for maintaining cloud infrastructure at the leading crypto brand in the Philippines. Collaborating with legal and compliance teams to ensure requirements are met while monitoring and troubleshooting systems.
Tech Lead SRE managing technology talent and connecting them to impactful projects in a healthy work environment. Seeking professionals with a solid technical foundation and product mindset.
Senior DevOps Engineer modernising environment landscapes through IaC and SRE principles while collaborating across teams for a global engineering firm.
DevOps Specialist at WayCarbon architecting and managing infrastructure for web applications. Focused on supporting a sustainable Net - Zero economy with a diverse tech team.
Intern assisting with cloud infrastructure automation for educational technology company UOL EdTech. Collaborating with teams on database operations and cloud deployment tasks.