Hybrid Senior Specialist – Cloud SRE

Posted last week

Apply now

About the role

  • Senior Site Reliability Engineer improving reliability and performance of business-critical services in multi-cloud AWS, Azure, and GCP environments. Collaborate with engineering teams to drive automation and measurable outcomes.

Responsibilities

  • Reliability Engineering & SRE Practices
  • Define, implement, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical services.
  • Continuously monitor SLO compliance and drive improvements based on error budget consumption.
  • Participate in architecture reviews focused on high availability, disaster recovery, scalability, and fault tolerance.
  • Lead incident response, acting as the Tier-3 escalation point for SRE and operations teams.
  • Drive blameless postmortems, Root Cause Analysis (RCA), and ensure corrective and preventive actions are implemented.
  • Define and maintain incident response runbooks, escalation paths, and on-call processes.
  • Track and improve key reliability metrics including MTTR, incident frequency, and change failure rate.
  • Automate infrastructure provisioning and operational workflows using Terraform, CloudFormation, and AWS CDK.
  • Build and maintain CI/CD pipelines supporting canary deployments, blue/green strategies, and automated rollbacks.
  • Implement event-driven automation and auto-remediation using AWS Lambda, Step Functions, or Azure Functions.
  • Continually identify and eliminate operational toil through automation and self-healing systems.
  • Design, implement, and operate end-to-end observability platforms covering metrics, logs, and traces.
  • Ensure alerts are SLO-driven, actionable, and noise-free.
  • Provision and manage cloud infrastructure across AWS, Azure, and/or GCP.
  • Operate compute, storage, networking, load balancers, VPNs, and private connectivity.
  • Manage patching, backups, encryption, IAM/RBAC, and disaster recovery readiness.
  • Optimize performance and cost through rightsizing, autoscaling, and capacity planning.

Requirements

  • 8–10 years of experience in SRE, Cloud Engineering, or Production Operations roles.
  • Strong OS fundamentals: Linux and Windows, with scripting (Bash, PowerShell).
  • Strong programming skills in Python, Go, or equivalent.
  • Proven hands-on experience with:
  • Infrastructure as Code (Terraform, CloudFormation, CDK)
  • CI/CD pipelines and deployment automation
  • Observability tools (New Relic, Datadog, Prometheus, Grafana, Graylog, ELK)
  • Distributed systems at production scale
  • Cloud certifications (one or more):
  • AWS (Associate or Professional)
  • Azure (AZ-104 / Architect Expert)
  • GCP (Professional Cloud Architect)
  • Cloud-agnostic certification such as Terraform Associate, CKA, or SRE Foundation.

Job title

Senior Specialist – Cloud SRE

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job