Director of Reliability Engineering at F5, leading teams in IT Service Management and infrastructure. Focused on cloud and observability practices, driving technical excellence and innovation.
Responsibilities
Support infrastructure operation teams focused on cloud and on-premise infrastructure.
Support sustainable engineering practices, including systematic intake, driving infrastructure management practices, and modern configuration management.
Support cloud platforms practices for a multi-cloud ecosystem including AWS, GCP, and Azure.
Establish cloud platform practices and help build technologies and practices that lower the barrier of entry for engineers using cloud infrastructure.
Support engineering enablement services and practices.
Help build a strategic roadmap for tooling and automation that will allow engineers to quickly, securely, and effectively build applications in cloud ecosystems.
Own critical ITIL practices including change management, problem management, and incident management.
Drive adoption of documented processes, using strong cross-organizational relationships to ensure success and support maturity assessments within Digital.
Build connectivity between process and practices, helping drive robust metrics and simplified strategies for turning documentation into engineering practices.
Own incident management responses and ensure communications and escalations to senior leadership are simple and effective.
Drive change management activities, including CAB, release management, and audit execution.
Build and blend modern observability practices with traditional NOC/SOC teams to create a lean and robust monitoring ecosystem between SaaS, cloud, and on-premise services.
Integrate incident management practices with automated observability tools and methodologies to drive visibility into system health and ensure service owners know about issues before their users.
Establish clear metrics such as Key Performance Indicators (KPIs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to measure and continuously improve operational performance.
Foster a proactive culture of monitoring and early detection to identify and address system anomalies before they impact users or infrastructure reliability.
Build and lead high-performing global teams across infrastructure, ITSM, and observability teams.
Create strategic roadmaps and participate as delivery lead in large program level initiatives.
Collaborate closely with Security, Engineering, Compliance, and Legal organizations to ensure alignment and transparency.
Mentor, develop, and support technical teams, driving a culture of ownership, innovation, and continuous improvement.
Define KPIs and metrics to measure operational performance and developer productivity.
Drive vendor strategy and manage partner relationships for infrastructure platforms and developer tools.
Requirements
12+ years of experience in infrastructure, platform engineering, ITIL, or developer tooling, with 5+ years in senior leadership roles.
Proven track record overseeing large-scale cloud environments and physical data centers in complex enterprise environments.
Expertise in Agile methodologies and driving team cultures through iterative improvement and technical excellence.
Expertise in infrastructure-as-code (Terraform, Ansible), cloud-native operations, and hybrid networking.
Deep understanding of developer platforms, including source control (GitHub, GitLab, Perforce, ADO), artifact repositories, CI/CD frameworks, and observability stacks.
Strong grasp of DevOps principles, platform engineering, and infrastructure automation practices.
Experience with NOC/SOC operations or observability practices and driving operational resilience through system health metrics.
Experience with compliance, risk management, and operational excellence frameworks (e.g., ITIL, SOC2, ISO).
Strategic thinker with excellent leadership, communication, and stakeholder management skills.
Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.
Senior Reliability Engineer at Sonova ensuring dependable performance of hearing solutions for millions of users globally. Involves engineering skills to improve product reliability across development stages.
Equipment and Reliability Engineer at Chobani responsible for improving asset efficiency, redesigning equipment. Collaborating with Operations to solve complex problems and lead projects in a team environment.
Reliability Engineer II focused on enhancing safety, efficiencies, and cost controls at Freeport - McMoRan mining operations. Collaborating with multiple teams and managing engineering projects.
Reliability Engineer I responsible for equipment failure analysis and improvement recommendations at Freeport - McMoRan's copper smelting operations. Ensuring uninterrupted production and managing equipment health through data analysis.
Designing, building, and maintaining the Kubernetes - based developer platform for Schwarz IT Barcelona. Collaborating with engineering teams to enhance services in Azure and Google Cloud.
Database Reliability Engineer managing MySQL database infrastructure at PointClickCare. Collaborating with Engineering and SRE teams for product development and reliable integration across the platform.
Teamleitung in der Gebäudereinigung in Grimma, verantwortliche Planung, Organisation und Führung des Reinigungsteams. Aktive Mitarbeit und Einhaltung von Hygiene - und Qualitätsstandards sind erforderlich.
Service Reliability Engineer providing technical support and managing incidents for BT International. Ensuring system availability and collaboration with global stakeholders to achieve objectives.
Studying Bachelor of Arts in Accounting, Taxation, and Economic Law while gaining practical experience in a dynamic team. Benefit from a diverse working day and continuous development opportunities.
Technical Trainer conducting workshops and training sessions on MERKUR Group's product content for diverse audiences. Engaging with employees and clients to ensure smooth product operation and understanding.