Hybrid Site Reliability Engineer

Posted last week

Apply now

About the role

  • Ensure proper monitoring, alerting, and observability across production and development environments
  • Collaborate with Product, Engineering, and IT Operations teams to identify and resolve issues affecting application performance and stability
  • Design and build self-service tools and automation to reduce manual operational work and improve response times
  • Participate in Change Management and Incident Review processes, contributing to root cause analysis and long-term fixes
  • Develop and enhance operational SLOs, SLIs, and SLAs in partnership with engineering teams
  • Automate scaling and recovery processes to improve system resilience
  • Support services before they go live through design reviews, capacity planning, and operational readiness assessments
  • Participate in a shared on-call rotation to ensure 24x7 production system reliability
  • Continuously evaluate and adopt emerging technologies to optimize performance, cost efficiency, and automation
  • Contribute to a healthy and collaborative engineering culture through documentation, mentorship, and teamwork

Requirements

  • Bachelor’s degree in Computer Science or related field, or equivalent professional experience
  • 3+ years of experience in a technical or operations engineering role in a highly regulated environment
  • Hands-on experience with cloud platforms. Primarily AWS (EC2, RDS, Route53, S3, ECS, Lambda, IAM, VPC, CloudFront) but Azure/GCP are a plus
  • Proficiency in one or more scripting or programming languages: PowerShell, Python, Bash, C#, Golang, or TypeScript
  • Experience managing Windows Server and SQL Server environments; familiarity with Linux administration (Ubuntu)
  • Experience with Infrastructure as Code (IaC) tools like Terraform, Ansible, or CloudFormation
  • Knowledge of containerization and orchestration technologies, such as Kubernetes and ArgoCD
  • Familiarity with source control (Azure DevOps) and work management tools (Jira, Confluence)
  • Experience with monitoring, APM, and log aggregation tools such as Splunk, Prometheus, Grafana, Nagios, CloudWatch
  • Familiarity with distributed tracing concepts and experience using OpenTelemetry to instrument, collect, and analyze telemetry data
  • Understanding of networking fundamentals, automation frameworks, and DevOps principles
  • Familiarity with AI tooling and its application in modern development environments to streamline coding and problem solving

Benefits

  • Competitive compensation and total rewards benefits
  • Comprehensive health, dental, and vision insurance
  • 401(k) with generous company match
  • Paid time off and holidays
  • Hybrid and remote work opportunities
  • Career growth and development support
  • Collaborative, team-oriented culture

Job title

Site Reliability Engineer

Job type

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job