About the role

Site Reliability Engineer delivering AWS managed services support with a focus on system stability and automation. Collaborating with teams to resolve complex operational challenges in a hybrid role.

Responsibilities

Deliver tier two cloud operations managed services support for AWS environments
Provide 24x7x365 tier two support and escalation handling for AWS environments
Execute complex operational tasks including: Patching and managing Amazon Machine Images (AMIs)
Creating and configuring EC2 instances and RDS databases
Managing IAM roles, users, and policies
Configuring S3 bucket policies and Access Control Lists (ACLs)
Opening and managing network routes
Restoring snapshots and database backups to lower environments
Increasing disk sizes and managing storage optimization
Implementing proper tagging for environment identification and cost allocation
Managing logs archiving and retention policies
Handle escalations from tier one support with deep technical analysis
Provide root cause analysis for complex incidents and recurring issues
Implement and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Lead tier two incident response, performing advanced troubleshooting and resolution
Conduct thorough post-incident analysis with actionable remediation plans
Reduce reactive work by improving runbooks, alert configurations, and standard operating procedures
Apply reliability engineering best practices with oversight and review
Mentor tier one engineers during incident response
Build and maintain CI/CD pipelines for infrastructure and application deployments
Automate complex operational tasks including patching, backups, and environment provisioning
Develop infrastructure automation using Terraform or equivalent IaC tools
Create sophisticated scripts and tooling to eliminate manual toil and improve operational efficiency
Follow established patterns and contribute continuous improvements
Document automation processes for knowledge sharing
Deploy and operate containerized workloads using Docker on AWS services (ECS, EKS, or other managed container platforms)
Support container reliability through proper health checks, autoscaling configurations, and resource management
Implement safe deployment patterns (canary deployments, blue/green deployments)
Troubleshoot complex containerization and orchestration issues
Configure and maintain comprehensive monitoring, logging, and alerting systems
Leverage observability data to identify issues and lead root cause analysis
Contribute to performance tuning and cost optimization initiatives
Ensure proper instrumentation and telemetry across AWS environments
Identify patterns and trends to prevent future incidents
Build custom dashboards and reports for operational insights
Work closely with customer development and operations teams
Participate in design reviews and reliability assessments
Communicate technical concepts, tradeoffs, and recommendations clearly to stakeholders
Provide regular operational updates and service reports
Act as technical liaison between customers and internal engineering teams

Requirements

4 to 8 years of experience in DevOps, SRE, or production operations roles
Proven experience operating production systems in AWS environments
Demonstrated experience managing containerized applications in production
Experience delivering managed services or supporting customer-facing infrastructure
Track record of handling complex technical escalations
Strong working knowledge of EC2, RDS, S3, IAM, VPC, CloudWatch, and related services
Hands-on experience with Docker and container orchestration platforms (ECS, EKS, or managed Kubernetes)
Proficiency with Terraform or equivalent tools
Experience building and maintaining automated deployment pipelines
Proficiency in Python, Go, Bash, or similar languages
Experience with observability tools (CloudWatch, Datadog, Splunk, ELK, or similar)
Proficiency with Git and collaborative development workflows
Advanced diagnostic and problem-solving capabilities
Experience with 24x7 operations and tier two escalation support

Benefits

Health insurance
Paid time off
Flexible work arrangements

Hybrid Senior Associate Cloud SRE

at Datavail

About the role

Responsibilities

Requirements

Benefits

Job title

Job type

Experience level

Salary

Degree requirement

Tech skills

Location requirements

Report this job

Similar roles

Maintenance Mechanical/Reliability Engineer

SABIC

Senior DevOps Engineer

Verizon

Senior DevOps Engineer

Jobs2web

Site Reliability Engineer Intern

Tencent

Cloud/DevOps Specialist – Pre-Trade Squad

N5X

Cloud/DevOps Specialist – Trade Squad

N5X

Reliability Engineering Specialist

Enbridge

Senior DevOps Specialist

Magnum Tires

DevSecOps Software Engineer – Experienced/Senior

Boeing

DevOps Manager – USAF Cloud One

Leidos