Site Reliability Engineer managing incident response and observability solutions in hybrid work environment at F5. Collaborating across teams to enhance system reliability and communication during incidents.
Responsibilities
Lead the resolution of major incidents by managing the end-to-end incident lifecycle, including detection, escalation, troubleshooting, and resolution
Serve as the incident facilitator during escalations, ensuring effective, clear, and timely communication between all stakeholders to drive collaborative problem-solving
Ensure appropriate handoffs and escalations between global engineering and incident management teams
Coordinate root cause analysis (RCA) efforts, facilitating discussions to identify contributing factors, lessons learned, and long-term corrective actions to reduce the likelihood of recurrence
Create, document, and improve incident response and management processes, defining clear roles and responsibilities for all participants during incidents
Ensure stakeholders and leadership across business and technical teams are kept informed with clear, concise updates during incidents, minimizing customer and business impact
Ensure open lines of communication by ensuring engineering teams engage in communication processes during incidents and have a clear understanding of their responsibilities
Design, implement, and manage end-to-end observability solutions, including synthetic monitoring, infrastructure monitoring, tracing and metrics monitoring systems
Evaluate, deploy, and maintain observability and monitoring tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic or similar platforms
Maintain and manage escalation tooling such as VictorOps or PagerDuty to ensure teams across have up to date schedules and escalation processes
Build and maintain monitoring and alerting for critical systems, ensuring that warnings and issues are quickly identified and actionable in real time
Drive the standardization of monitoring practices across teams, ensuring critical applications, systems, and infrastructure components are well-instrumented and monitored
Develop infrastructure monitoring pipelines leveraging telemetry, logging, tracing, metrics, and visualization tools to provide accurate insights into production system health
Support efforts to define and document standard operating procedures for managing incidents, alerts, system failures, and post-incident reviews across global teams
Collaborate with development, infrastructure, and security teams to improve system reliability through efficient processes and workflows
Advocate for the development and implementation of SLAs, SLOs, and error budgets to support decision-making and prioritization in reliability efforts
Identify and implement opportunities to automate manual operational tasks to further reduce incident response and resolution times
Work closely with service desk to ensure consistent incident management practices and appropriate escalations to major incident management team
Partner with engineering, operations, and security teams to confirm observability tools and monitoring approaches meet their needs and align with organizational standards
Actively engage during incident scenarios to ensure identification and mobilization of the appropriate resources, facilitating collaboration across teams and ensuring best practices are followed
Contribute to a culture of shared responsibility and blameless postmortems by documenting and communicating findings from incident responses
Proactively provide input to the SRE Manager to recommend improvements in processes, tools, and systems to enhance team capabilities and outcomes
Requirements
Bachelor’s degree in Computer Science , Information Technology, or a related field (or equivalent professional experience)
3+ years of professional experience in Site Reliability Engineering (SRE), System Engineering, DevOps, or IT Operations roles
Highly experienced as a major incident manager, incident commander, or similar role, with a proven ability to facilitate, communicate, and drive resolution of technical incidents
Strong understanding of ITIL principles and their application in incident management
Experience with observability tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic, or similar technologies
Experience with synthetic monitoring, infrastructure monitoring, and metrics and tracing monitoring tools
Experience with hybrid infrastructure environments and understand monitoring signals from static on-premise infrastructure, cloud based ephemeral infrastructure, and SaaS applications
Strong understanding of telemetry, logging, tracing, and their roles in system monitoring and observability pipelines
Experience with Python, Go, Bash, or a similar language to develop and maintain monitoring and automation scripts
Proven ability to remain calm and effective during high-pressure situations, facilitating resolution in a methodical, professional manner.
DevOps Engineer II evolving cloud infrastructure and CI/CD pipelines at HackerRank. Collaborating with teams to design, build, and optimize systems for developer productivity.
DevOps Engineer managing CI/CD pipelines and cloud infrastructure for mobile apps at Air Apps. Collaborating with teams to ensure app performance and reliability.
DevOps Engineer at Vodafone Romania delivering resilient infrastructure for software development lifecycle. Collaborating with Digital Squads and optimizing CI/CD pipelines for efficient deployments.
Mechanical/Reliability Engineer responsible for mechanical installations in Bergen op Zoom. Analyzing maintenance strategies and leading projects to enhance reliability.
Senior DevOps Engineer responsible for cloud infrastructure and deployments. Optimizing AWS services and ensuring system security and reliability for Verizon.
Senior DevOps Engineer responsible for automating infrastructure and building CI/CD pipelines for collaborative robotics company. Collaborating with global engineering teams from the Bangalore office.
Site Reliability Engineer Intern at Tencent working on gaming services and cloud native solutions. Collaborating with global teams to eliminate toil and enhance reliability.
Cloud/DevOps Specialist at N5X managing and optimizing critical cloud infrastructures for Brazilian energy trading. Collaborating with a multidisciplinary team to ensure high availability and performance.
Cloud/Devops Specialist responsible for designing a hybrid architecture combining cloud and on - premises infrastructure for energy trading systems. Collaborating with a multidisciplinary team in a dynamic environment.
Reliability Engineering Specialist utilizing reliability tools and models to improve asset performance at Enbridge. Collaborating across teams to guide investment decisions for safe operations.