Senior Site Reliability Engineer focused on building reliable, scalable infrastructure at a tech company. Driving best practices in observability, incident response, and engineering collaboration.
Responsibilities
Design, build, and maintain highly available, scalable, and fault-tolerant systems
Lead reliability improvements across production and non-production environments
Own and improve monitoring, alerting, and observability platforms
Drive incident response, root cause analysis, and post-incident reviews
Implement automation to reduce manual operational work
Partner with Engineering, Security, and Product to support platform needs
Establish and track SLIs, SLOs, and error budgets
Lead capacity planning and performance tuning efforts
Improve deployment, CI/CD, and infrastructure-as-code practices
Identify and mitigate reliability and scalability risks before they impact customers
Mentor and guide junior engineers and contribute to team technical standards
Participate in on-call rotation and help mature on-call processes
Requirements
6+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related roles
Strong experience with cloud platforms (AWS, Azure, or GCP)
Proficiency with infrastructure as code (Terraform, CloudFormation, Pulumi, etc.)
Experience with containerization and orchestration (Docker, Kubernetes)
Strong Linux systems administration and networking fundamentals
Experience building and maintaining CI/CD pipelines
Hands-on experience with monitoring and observability tools (Datadog, Prometheus, Grafana, New Relic, etc.)
Strong troubleshooting and incident management skills
Experience with scripting and automation (Python, Bash, Go, or similar)
DevOps Engineer I provisioning scalable cloud infrastructure for Spring Health's mental health solutions. Focused on automation and reliability within engineering systems while collaborating closely with team.
DevOps Developer responsible for maintaining CI/CD pipelines and automation processes. Collaborating with development teams for efficient application integration and quality software delivery.
Senior DevOps Engineer managing AWS and GCP infrastructure in a mission - driven healthcare company. Focus on automation, stability, and performance across cloud environments.
Engineering Manager leading the Site Reliability Engineering team at a fintech company. Ensuring the reliability, scalability, and performance of our digital banking platform.
Release Engineer with DevOps expertise in deployment and infrastructure at IGT. Ensuring fast, secure, repeatable deployments in a dynamic gaming environment.
Junior DevOps Engineer supporting design and maintenance of CI/CD pipelines in a hybrid work environment. Collaborating with teams to ensure cloud infrastructure reliability for AI solutions.
DevOps Engineer designing, implementing, and maintaining infrastructure for payment processing fintech Nuvei. Collaborating across teams for efficient software delivery and cloud resource management.
DevOps Engineer supporting IT platform development at K - tronik GmbH. Join a team focused on container - based customer platforms and CI/CD pipeline automation in a hybrid role.
Site Reliability Engineering Specialist ensuring service performance and reliability at BT Group. Driving automation and cloud solutions while mentoring a diverse team of engineers.