Site Reliability Engineer leading observability and monitoring practices for hybrid infrastructure at PANTHERx. Collaborating with various teams to enhance system performance and reliability.
Responsibilities
The Site Reliability Engineer (SRE) will lead the implementation and management of observability, monitoring, and reliability practices across our hybrid infrastructure.
This role requires hands-on expertise with Datadog or similar observability platforms, strong Azure administration skills, and a deep understanding of incident response and system performance.
The SRE will work closely with Infrastructure, Support, and Application teams to ensure high availability and operational excellence across on-prem and cloud environments.
Designs, implements, and manages observability solutions using Datadog or equivalent platforms.
Develops and maintains monitoring dashboards, alerts, and telemetry pipelines for critical systems.
Leads incident response efforts, including root cause analysis and postmortem documentation.
Collaborates with Infrastructure and Application teams to improve system reliability and performance.
Supports Azure administration tasks including resource monitoring, performance tuning, and cost optimization.
Defines and enforces best practices for system health, uptime, and scalability.
Contributes to automation of operational tasks and reliability improvements.
Documents observability standards, incident workflows, and operational runbooks.
Requirements
Bachelor’s degree in Computer Science, Information Technology, or equivalent.
Minimum of five (5) years of experience in Site Reliability Engineering, Infrastructure Monitoring, or DevOps.
Proficiency with Datadog or similar observability platforms (e.g., Prometheus, New Relic, Splunk).
Strong Azure administration experience including monitoring, resource management, and automation.
Solid understanding of on-prem infrastructure and hybrid cloud environments.
Experience with incident response, RCA, and operational documentation.
Strong scripting skills (e.g., PowerShell, Python) for automation and integration.
Excellent communication and collaboration skills across technical teams.
Benefits
Hybrid, remote and flexible on-site work schedules are available, based on the position.
Excellent benefit package, including but not limited to medical, dental, vision, health savings and flexible spending accounts
401K with employer matching
Employer-paid life insurance and short/long term disability coverage
Employee Assistance Program
Generous paid time off is also available to all full-time employees
Site Reliability Engineer II at LexisNexis Risk Solutions building Terraform modules and CI/CD pipelines. Responsible for developing cloud infrastructure and ensuring reliability, security, and observability.
DevOps Engineer supporting cloud modernization for the Department of the Air Force on the Cloud One contract. Involved in systems analysis, security practices, and collaboration with engineering teams.
Journeyman Cloud Operations Engineer maintaining cloud infrastructure across DoD organizations. Supporting DevSecOps and ensuring compliance with security requirements in a high - visibility program.
DevOps Engineer managing cloud - native platforms for Capgemini. Collaborating with development, data/ML, and security teams to deliver scalable solutions on Azure.
Head of IT & DevSecOps at JamLoop, managing internal technology and security improvements. Leading strategy and implementation of cloud infrastructure for efficiency and reliability.
I&E Maintenance and Reliability Engineer at LyondellBasell focused on asset maintenance strategies in a multidisciplinary environment. Collaborating for operational excellence and safety performance at the Pasadena facility.
Manager, DevOps & Cloud Infrastructure overseeing security and operational efficiency in a hybrid environment at Thomson Reuters. Leading teams to deliver secure solutions in on - premises and cloud setups.
DevOps Engineer responsible for building and maintaining the infrastructure of IONOS' AI platform. Collaborating on CI/CD pipelines and ensuring system optimization across various locations.
DevOps Engineer building and supporting cloud infrastructure at PointClickCare. Collaborate with senior engineers and software teams to enhance AI - enabled workloads and improve system reliability.
DevOps specialist working with Kubernetes and Terraform, ensuring project stability and efficiency for Convercus. Join a small, dynamic team in a hybrid work environment.