Site Reliability Engineer leading observability and monitoring practices for hybrid infrastructure at PANTHERx. Collaborating with various teams to enhance system performance and reliability.
Responsibilities
The Site Reliability Engineer (SRE) will lead the implementation and management of observability, monitoring, and reliability practices across our hybrid infrastructure.
This role requires hands-on expertise with Datadog or similar observability platforms, strong Azure administration skills, and a deep understanding of incident response and system performance.
The SRE will work closely with Infrastructure, Support, and Application teams to ensure high availability and operational excellence across on-prem and cloud environments.
Designs, implements, and manages observability solutions using Datadog or equivalent platforms.
Develops and maintains monitoring dashboards, alerts, and telemetry pipelines for critical systems.
Leads incident response efforts, including root cause analysis and postmortem documentation.
Collaborates with Infrastructure and Application teams to improve system reliability and performance.
Supports Azure administration tasks including resource monitoring, performance tuning, and cost optimization.
Defines and enforces best practices for system health, uptime, and scalability.
Contributes to automation of operational tasks and reliability improvements.
Documents observability standards, incident workflows, and operational runbooks.
Requirements
Bachelor’s degree in Computer Science, Information Technology, or equivalent.
Minimum of five (5) years of experience in Site Reliability Engineering, Infrastructure Monitoring, or DevOps.
Proficiency with Datadog or similar observability platforms (e.g., Prometheus, New Relic, Splunk).
Strong Azure administration experience including monitoring, resource management, and automation.
Solid understanding of on-prem infrastructure and hybrid cloud environments.
Experience with incident response, RCA, and operational documentation.
Strong scripting skills (e.g., PowerShell, Python) for automation and integration.
Excellent communication and collaboration skills across technical teams.
Benefits
Hybrid, remote and flexible on-site work schedules are available, based on the position.
Excellent benefit package, including but not limited to medical, dental, vision, health savings and flexible spending accounts
401K with employer matching
Employer-paid life insurance and short/long term disability coverage
Employee Assistance Program
Generous paid time off is also available to all full-time employees
Principal Site Reliability Engineer at Early Warning designing performance and resiliency patterns for applications and infrastructure. Collaborating with development teams to improve systems and data integrity.
DevOps Engineer contributing to CI/CD setup and Azure services management. Collaborates with teams to ensure efficient project delivery in a hybrid environment.
IT DevOps Specialist at BMW responsible for analyzing requirements and implementing software solutions in AWS cloud environments. Collaborating internationally within agile teams for digital transformation projects.
DevOps Engineer at Vistra designing, implementing, and maintaining robust CI/CD pipelines and cloud infrastructure. Enabling software delivery across multiple technology stacks with a focus on AWS.
Manage complex customer rollouts and initial system deployments at Talex.ai. Bridging technical development with real - world application in robotics and AI systems.
Cloud Operations Engineer designing and implementing highly reliable cloud solutions. Leading cloud infrastructure initiatives for production operations and customer success in a growing team.
Quality Engineer supporting new product launches and reliability testing for SSD at Micron in Malaysia. Responsible for coordinating test activities and conducting failure analysis.
Reliability Engineer ensuring operational readiness of data centers at Rowan Digital Infrastructure. Overseeing commissioning, operational standards, and transitioning facilities into live operations.
Manager of Mechanical Engineering ensuring high - availability mechanical systems in data centers. Collaborating on lifecycle management and performance evaluation across missions - critical facilities in a hybrid role.
DevOps Engineer developing reusable Ansible and Puppet modules and managing CI/CD for project teams. Join PLATH in Hamburg, focusing on crisis detection software development.