The Site Reliability Engineer (SRE) will lead the implementation and management of observability, monitoring, and reliability practices across our hybrid infrastructure.
This role requires hands-on expertise with Datadog or similar observability platforms, strong Azure administration skills, and a deep understanding of incident response and system performance.
The SRE will work closely with Infrastructure, Support, and Application teams to ensure high availability and operational excellence across on-prem and cloud environments.
Designs, implements, and manages observability solutions using Datadog or equivalent platforms.
Develops and maintains monitoring dashboards, alerts, and telemetry pipelines for critical systems.
Leads incident response efforts, including root cause analysis and postmortem documentation.
Collaborates with Infrastructure and Application teams to improve system reliability and performance.
Supports Azure administration tasks including resource monitoring, performance tuning, and cost optimization.
Defines and enforces best practices for system health, uptime, and scalability.
Contributes to automation of operational tasks and reliability improvements.
Documents observability standards, incident workflows, and operational runbooks.
Requirements
Bachelor’s degree in Computer Science, Information Technology, or equivalent.
Minimum of five (5) years of experience in Site Reliability Engineering, Infrastructure Monitoring, or DevOps.
Proficiency with Datadog or similar observability platforms (e.g., Prometheus, New Relic, Splunk).
Strong Azure administration experience including monitoring, resource management, and automation.
Solid understanding of on-prem infrastructure and hybrid cloud environments.
Experience with incident response, RCA, and operational documentation.
Strong scripting skills (e.g., PowerShell, Python) for automation and integration.
Excellent communication and collaboration skills across technical teams.
Benefits
Hybrid, remote and flexible on-site work schedules are available, based on the position.
Excellent benefit package, including but not limited to medical, dental, vision, health savings and flexible spending accounts
401K with employer matching
Employer-paid life insurance and short/long term disability coverage
Employee Assistance Program
Generous paid time off is also available to all full-time employees
Network & Datacenter Deployment Engineer at Cloudflare focused on building and expanding their global network infrastructure with collaboration across multiple engineering teams and vendors.
Senior DevOps Engineer leading cloud - native solutions at Sparksoft Corporation. Driving automation and system reliability within a fast - paced Agile team.
Platform Engineer focusing on supporting CI/CD pipelines and Kubernetes at PCCW. Responsible for ensuring platform services' reliability and performance, with night - time support as needed.
Site Reliability Engineer at Bumble optimizing large - scale Linux environments and ensuring system stability. Focusing on troubleshooting, incident recovery, and performance tuning in complex infrastructures.
Senior DevOps Manager overseeing CI/CD processes for NVIDIA Networking products. Leading a team and collaborating with global teams to enhance R&D efficiency and infrastructure.
DevOps Manager overseeing engineering team developing scalable CI/CD processes for NVIDIA Networking products. Enhancing global R&D efficiency in a technology - focused company.
Join Operations Team as Senior Site Reliability Engineer driving operational excellence for cybersecurity solutions. Collaborate across teams to manage production platforms and optimize infrastructure.
Software Developer - DevOps System Administrator working within the SCMT team to enhance software application efficiency. Collaborating on tools and scripts for application lifecycle management.