SRE role at BT Group focusing on cloud reliability and operational excellence across engineering teams. Collaborate with product owners to implement SRE principles for improved service performance.
Responsibilities
Partner with Product Owners and engineering leads to embed reliability into roadmaps, backlogs, and delivery decisions.
Apply SRE principles (SLIs, SLOs, error budgets) to maintain service reliability, performance, and scalability.
Enhance observability across metrics, logs, traces, and events to ensure services are observable by design.
Manage infrastructure as code and CI/CD environments, delivering improvements and supporting operational changes.
Lead incident response and root cause analysis, driving effective resolution, post incident reviews, and long term prevention.
Work with cross functional engineering teams to remove technical barriers, reduce toil, and improve service operability.
Provide hands on engineering support, validating technical decisions and promoting best practices.
Foster a culture of curiosity, experimentation, and first principles thinking to strengthen engineering excellence.
Requirements
Deep understanding of SRE concepts SLIs, SLOs, SLAs and error budgets
Proven ability to design and implement reliable environments
Hands-on experience with monitoring tools, application insights, integrations with tools such as Prometheus and Grafana
Infrastructure as Code skills e.g. Terraform
Advanced knowledge of vmware technology
Experience with CI/CD, automation and monitoring tools
Experience with disaster recovery planning and chaos engineering practices
Experience implementing identity governance and security frameworks
Benefits
Flexibility in working hours
Reasonable adjustments for the selection process if required
Senior Site Reliability Engineer at Uniphore developing cloud infrastructure and Go services. Collaborating with teams to ensure operational excellence and reliability.
As Learning Content Engineer, developing and enhancing training content for Cloud and DevOps. Engaging in creating practical learning materials from basics to advanced topics.
AWS DevOps Microservices Engineer at Solventum designing secure and scalable AWS infrastructures. Collaborating with diverse teams for innovative healthcare solutions using cloud technology.
DevOps Engineer building and maintaining Catena’s scalable platform infrastructure. Collaborating with engineers to enhance CI/CD pipelines and support cloud - native workloads on AWS.
Platform System Reliability Engineer focused on operations of EKS Kubernetes environment for GE Vernova's SaaS grid products. Responsible for the full lifecycle of production clusters from performance tuning to securing infrastructure.
SRE Observability SLO Engineer for GE Vernova’s GridOS Platform Engineering team. Building telemetry stack in SaaS reliability for critical energy infrastructure.
DevOps Engineer responsible for building and operating automation services using Ansible for Rabobank. Collaborating with teams to ensure stable, secure, and auditable infrastructure across multiple servers.