Site Reliability Engineer responsible for system reliability and performance at a support organization. Collaborating with development and operations to enhance system architecture and incident management.
Responsibilities
Partner with development, infrastructure, and operations teams to design highly available, fault-tolerant, and disaster recovery–ready systems.
Implement Infrastructure-as-Code (e.g., Terraform) to automate provisioning, scaling, and management of cloud services (AWS, Azure).
Lead and support incident triage, resolution, and recovery efforts during critical events.
Provide advanced troubleshooting expertise and guide teams during outages.
Conduct detailed postmortems, document lessons learned, and drive improvements to reduce Mean Time to Recovery (MTTR).
Collaborate with developers, QA, and product teams to embed reliability principles throughout the software development lifecycle.
Mentor peers on observability tools, performance optimization, and SRE best practices.
Identify opportunities for continuous improvement in reliability, performance, and cost efficiency.
Evaluate and recommend emerging technologies to enhance scalability and resilience.
Contribute to internal documentation, ensuring best practices are accessible across the organization.
Requirements
4+ years of experience in DevOps, Site Reliability Engineering, or a related role.
Proven track record as a technical lead or subject matter expert (no direct people management required).
Hands-on expertise with cloud platforms (AWS, Azure) and Infrastructure-as-Code (Terraform preferred).
Strong understanding of systems architecture, high availability, fault tolerance, and disaster recovery.
Experience leading incident response and conducting root cause analysis.
Familiarity with observability tools and performance optimization practices.
Strong collaboration and communication skills with the ability to mentor peers and influence best practices.
DevOps Engineer focusing on deploying high - security on - prem infrastructure and MLOps platforms for mission - critical systems. Collaborating on Kubernetes - based orchestration and machine learning workloads.
Cloud Site Reliability Engineer managing Solace Cloud services across leading cloud providers. Ensuring reliability, handling incidents, and collaborating with customers for operational excellence.
Senior Cloud Site Reliability Engineer ensuring reliability and health of Solace Cloud Services with hands - on cloud operations expertise. Lead incident management and customer support for high - impact environments.
DevOps Engineer designing and operating AWS infrastructure within industrial IoT environments. Working on systems that ensure security, resilience, and end - to - end observability.
Sr. Site Reliability Engineer (SRE) III providing technical solutions for the federal government. Collaborating in a high - performing team focused on reliability and application scalability.
Senior Linux System Engineer developing and maintaining Linux server infrastructure for Th. Geyer GmbH. Collaborating on ERP systems and CI/CD processes while ensuring system performance and security.
Platform Engineer leading the development of cloud application platforms for Allstate. Responsible for cloud infrastructure for ML experimentation and production deployments.
Cloud Platform Engineer (ML DevOps) developing and managing CI/CD pipelines for ML workflows in a leading insurance company. Collaborating with data scientists and ensuring infrastructure security and compliance.