Site Reliability Engineer responsible for system reliability and performance at a support organization. Collaborating with development and operations to enhance system architecture and incident management.
Responsibilities
Partner with development, infrastructure, and operations teams to design highly available, fault-tolerant, and disaster recovery–ready systems.
Implement Infrastructure-as-Code (e.g., Terraform) to automate provisioning, scaling, and management of cloud services (AWS, Azure).
Lead and support incident triage, resolution, and recovery efforts during critical events.
Provide advanced troubleshooting expertise and guide teams during outages.
Conduct detailed postmortems, document lessons learned, and drive improvements to reduce Mean Time to Recovery (MTTR).
Collaborate with developers, QA, and product teams to embed reliability principles throughout the software development lifecycle.
Mentor peers on observability tools, performance optimization, and SRE best practices.
Identify opportunities for continuous improvement in reliability, performance, and cost efficiency.
Evaluate and recommend emerging technologies to enhance scalability and resilience.
Contribute to internal documentation, ensuring best practices are accessible across the organization.
Requirements
4+ years of experience in DevOps, Site Reliability Engineering, or a related role.
Proven track record as a technical lead or subject matter expert (no direct people management required).
Hands-on expertise with cloud platforms (AWS, Azure) and Infrastructure-as-Code (Terraform preferred).
Strong understanding of systems architecture, high availability, fault tolerance, and disaster recovery.
Experience leading incident response and conducting root cause analysis.
Familiarity with observability tools and performance optimization practices.
Strong collaboration and communication skills with the ability to mentor peers and influence best practices.
Junior and DevOps Engineers designing and running secure cloud - native platforms for UK public - sector organisations. Collaborating with teams to streamline deployment and automate infrastructure workflows.
DevOps Engineer at Gemba designing secure, cloud - native platforms for public - sector organizations. Leading technical decisions and collaborating to solve complex challenges for critical systems.
DevOps Engineer designing and constructing secure cloud - native platforms for public - sector organizations across the UK. Leading technical decisions while collaborating closely with clients.
DevOps Engineer automating cloud - native infrastructure for public - sector organizations. Join an agile team to enhance deployment processes and support critical systems.
Site Reliability Engineer optimizing global trading infrastructure for a crypto capital markets partner. Responsibilities include cloud environment management and system design for high availability.
DevOps Engineer responsible for implementing and operating CI/CD pipelines for SaaS services. Collaborating with teams to ensure reliable and secure operations in the Risk & Fraud business unit.
Site Reliability Engineer focused on building resilient systems and ensuring uptime at MealSuite. Involved in troubleshooting, platform reliability, and enhancing deployment automation.
(Senior) DevOps Engineer at Wavestone developing and operating complex software solutions for digitalization projects. Collaborating in teams and contributing to technology landscape advancements.
Reliability Engineer focused on the dependability and mission success of complex space systems. Involvement includes analyses, collaboration, and adherence to aerospace reliability standards.