Site Reliability Engineer responsible for system reliability and performance at a support organization. Collaborating with development and operations to enhance system architecture and incident management.
Responsibilities
Partner with development, infrastructure, and operations teams to design highly available, fault-tolerant, and disaster recovery–ready systems.
Implement Infrastructure-as-Code (e.g., Terraform) to automate provisioning, scaling, and management of cloud services (AWS, Azure).
Lead and support incident triage, resolution, and recovery efforts during critical events.
Provide advanced troubleshooting expertise and guide teams during outages.
Conduct detailed postmortems, document lessons learned, and drive improvements to reduce Mean Time to Recovery (MTTR).
Collaborate with developers, QA, and product teams to embed reliability principles throughout the software development lifecycle.
Mentor peers on observability tools, performance optimization, and SRE best practices.
Identify opportunities for continuous improvement in reliability, performance, and cost efficiency.
Evaluate and recommend emerging technologies to enhance scalability and resilience.
Contribute to internal documentation, ensuring best practices are accessible across the organization.
Requirements
4+ years of experience in DevOps, Site Reliability Engineering, or a related role.
Proven track record as a technical lead or subject matter expert (no direct people management required).
Hands-on expertise with cloud platforms (AWS, Azure) and Infrastructure-as-Code (Terraform preferred).
Strong understanding of systems architecture, high availability, fault tolerance, and disaster recovery.
Experience leading incident response and conducting root cause analysis.
Familiarity with observability tools and performance optimization practices.
Strong collaboration and communication skills with the ability to mentor peers and influence best practices.
DevOps Engineer responsible for managing Microsoft Intune operations at Bundesdruckerei GmbH. Focused on ensuring secure digital solutions for identity and data protection in Berlin.
Senior Site Reliability Engineer driving observability and reliability for business - critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.
DevSecOps Specialist securing the software development lifecycle at Vanguard. Collaborating with teams to improve application security tooling and processes, and provide development guidance.
Site Reliability Engineer automating infrastructure deployment for Scaleway's sovereign cloud products. Collaborating with product teams to enhance observability and reliability of the platform.
Reliability Engineer responsible for equipment reliability and safety using data - driven analysis for Wood in Aberdeen. Focus on proactive maintenance and operational efficiency.
Principal Safety and Reliability Engineer developing and supporting safety design for mission - critical aerospace systems. Engaging in design reviews and ensuring compliance with requirements.
Cloud DevOps Engineer playing a pivotal role in developing migration plans for Coast Guard Cloud Architecture. Collaborating with teams to ensure effectiveness and best practices in cloud implementation.
Reliability Engineer III at Daimler Truck developing propulsion technology solutions for electrified and conventional axle components. Leading testing and validation for complex powertrain systems.
Electrical Reliability Engineer at Marathon Petroleum maintaining electrical equipment and systems. Collaborating with cross - functional teams and ensuring compliance with electrical codes and standards.