About the role

Principal Site Reliability Engineer at Red Hat managing the RHIVOS product SRE initiative. Focusing on infrastructure reliability and continuous improvement with deep technical expertise in engineering.

Responsibilities

Architect, design and lead the implementation of the RHIVOS product SRE initiative.
Instrument metrics to support Service Level Objectives (SLO), Service Level Indicators (SLI) and Service Level Agreements (SLA) for critical services.
Utilize metrics designed and built into the software to analyze system performance and identify performance bottlenecks, underutilized hardware or scale the infrastructure design.
Review team contributions to software correcting errors and provide constructive feedback.
Lead and participate in incident response and postmortems, help identify steps to minimize Mean Time To Resolution (MTTR).
Regularly contribute to internal workshops and training to upskill the team as the product architecture evolves.
Configure and maintain software production infrastructure and tooling.
Serve as an internal expert on infrastructure and tooling, including software production pipelines, providing guidance to engineering teams and making high-level recommendations to improve efficiency, reliability, and stability.
Create/maintain service monitoring, improve automation, uphold security best practices and respond to various service situations for the software production infrastructure.
Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
Collaborate on incident retrospective reviews and corrective items implementation.
Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
Coordinate your actions with other Red Hat teams such as IT and Product Security to ensure our infrastructure meets quality expectations.
Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
Helpout/backup RHIVOS Raleigh lab SRE when needed.

Requirements

8+ years of software reliability engineering experience with deep expertise in Linux systems, infrastructure-as-code, and complex, distributed enterprise environments.
Linux administration expertise
Advanced experience of Kubernetes/OpenShift administration and application development
Advanced experience of automation services like Ansible or Terraform
Advanced experience of CI/CD platforms like GitLab CI, Tekton and Pipelines as a code (optionally GitHub Actions etc)
Advanced experience and experience with monitoring platforms and technologies
Advanced experience and experience of AWS technologies
Experience with open source monitoring technologies (Grafana, Prometheus, OpenTelemetry)
Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team
Proven track record for leading and hands on implementing a program/product wide adoption of a data-driven reliability framework by architecting complex, multi-service SLO/SLI standards and institutionalizing error budget policies that effectively balance rapid feature velocity with global system stability
Previous experience with the Site Reliability Engineer (SRE) model and software development using Python or GoLang.
Ability to work in the Raleigh office when needed

Benefits

Comprehensive medical, dental, and vision coverage
Flexible Spending Account - healthcare and dependent care
Health Savings Account - high deductible medical plan
Retirement 401(k) with employer match
Paid time off and holidays
Paid parental leave plans for all new parents
Leave benefits including disability, paid family medical leave, and paid military leave
Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!

Hybrid Principal Site Reliability Engineer – Automotive

at Red Hat

About the role

Responsibilities

Requirements

Benefits

Job title

Job type

Experience level

Salary

Degree requirement

Tech skills

Location requirements

Report this job

Similar roles

Senior DevOps Engineer

Verisk

Senior DevOps Engineer – Infrastructure

IMAGO

DevOps Specialist

Evlo

Software Quality and Release Engineer

Turion Space

Senior DevOps Engineer

Exacaster

Site Reliability Engineer, DevOps

Exacaster

Design and Release Engineer – Mirror Systems

Ford Motor Company

Site Reliability Engineer

VALCE Talent Solutions

Senior DevOps Engineer

Stillfront Group

Mainframe DevOps Engineer – SCM Migration SME

Kyndryl