Delivering against GCP and SRE Public Cloud technology roadmaps
Collaborating with engineering teams to release and evolve enterprise-class solutions
Managing operations of critical banking services, including 24x7 coverage via on-call rota
Enhancing resiliency and reliability of customer-facing services
Troubleshooting and diagnosing issues with an engineering mindset
Building tooling to support service reliability and code quality
Working across multiple labs and signature projects in the Digital space
Leading Chaos Engineering initiatives to stress test services
Requirements
Strong understanding of SRE & DevOps, including experience of Infrastructure as Code and CI/CD pipelines using tools such as Azure DevOps, Terraform, or Jenkins.
Proficiency with Incident Management software (ie ServiceNow)
Proficient in Dynatrace, Splunk, SRE GCP & Cloud Observability.
Demonstrable experience in using orchestrations tools such as Harness.
Knowledge of GCP and Azure cloud platforms.
Experience in identifying toil and design automated solutions to remove it.
Reliability & Performance Management: Design, implement and own the SLOs for critical platform services. Monitor system health, manage error budgets, and drive improvements in Mean Time to Failure (MTTF) and Mean Time to Recovery (MTTR).
Incident & Problem Management: Lead incident response and post-mortem analysis. Ensure root cause identification and long-term remediation strategies are implemented.
Platform Advocacy & Collaboration: Champion SRE principles across Segments & Propositions Lab. Collaborate with Lab Product Owners, Engineering Leads, and application teams to embed reliability into design and delivery.
Technical Leadership: Provide technical oversight across cloud infrastructure, CI/CD pipelines, observability tooling, and automation frameworks. Guide engineers in adopting scalable and resilient solutions.
Continuous Improvement: Identify and implement improvements in deployment, monitoring, and alerting processes. Drive automation to reduce toil and improve operational efficiency.
Governance & Compliance: Ensure platform services adhere to internal risk, security, and compliance standards. Support audit and regulatory reporting requirements.
Benefits
A generous pension contribution of up to 15%
An annual performance-related bonus
Share schemes including free shares
Benefits you can adapt to your lifestyle, such as discounted shopping
30 days’ holiday, with bank holidays on top
A range of wellbeing initiatives and generous parental leave policies
Platform Engineer focusing on supporting CI/CD pipelines and Kubernetes at PCCW. Responsible for ensuring platform services' reliability and performance, with night - time support as needed.
Site Reliability Engineer at Bumble optimizing large - scale Linux environments and ensuring system stability. Focusing on troubleshooting, incident recovery, and performance tuning in complex infrastructures.
Senior DevOps Manager overseeing CI/CD processes for NVIDIA Networking products. Leading a team and collaborating with global teams to enhance R&D efficiency and infrastructure.
DevOps Manager overseeing engineering team developing scalable CI/CD processes for NVIDIA Networking products. Enhancing global R&D efficiency in a technology - focused company.
Join Operations Team as Senior Site Reliability Engineer driving operational excellence for cybersecurity solutions. Collaborate across teams to manage production platforms and optimize infrastructure.
Software Developer - DevOps System Administrator working within the SCMT team to enhance software application efficiency. Collaborating on tools and scripts for application lifecycle management.
DevOps Engineer managing CI/CD pipelines and Kubernetes deployments at Stefanini. Collaborating with teams to optimize application health and deployment processes.
DevOps Engineer working with development teams for seamless feature integration and deployment automation. Focus on CI/CD pipelines, monitoring solutions, and continuous process optimization.
DevOps Network Administrator managing AWS cloud environments for mission - critical systems. Ensuring performance, scalability, and security while integrating modern DevOps and cybersecurity practices.