Principal Site Reliability Engineer at Red Hat managing the RHIVOS product SRE initiative. Focusing on infrastructure reliability and continuous improvement with deep technical expertise in engineering.
Responsibilities
Architect, design and lead the implementation of the RHIVOS product SRE initiative.
Instrument metrics to support Service Level Objectives (SLO), Service Level Indicators (SLI) and Service Level Agreements (SLA) for critical services.
Utilize metrics designed and built into the software to analyze system performance and identify performance bottlenecks, underutilized hardware or scale the infrastructure design.
Review team contributions to software correcting errors and provide constructive feedback.
Lead and participate in incident response and postmortems, help identify steps to minimize Mean Time To Resolution (MTTR).
Regularly contribute to internal workshops and training to upskill the team as the product architecture evolves.
Configure and maintain software production infrastructure and tooling.
Serve as an internal expert on infrastructure and tooling, including software production pipelines, providing guidance to engineering teams and making high-level recommendations to improve efficiency, reliability, and stability.
Create/maintain service monitoring, improve automation, uphold security best practices and respond to various service situations for the software production infrastructure.
Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
Collaborate on incident retrospective reviews and corrective items implementation.
Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
Coordinate your actions with other Red Hat teams such as IT and Product Security to ensure our infrastructure meets quality expectations.
Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
Helpout/backup RHIVOS Raleigh lab SRE when needed.
Requirements
8+ years of software reliability engineering experience with deep expertise in Linux systems, infrastructure-as-code, and complex, distributed enterprise environments.
Linux administration expertise
Advanced experience of Kubernetes/OpenShift administration and application development
Advanced experience of automation services like Ansible or Terraform
Advanced experience of CI/CD platforms like GitLab CI, Tekton and Pipelines as a code (optionally GitHub Actions etc)
Advanced experience and experience with monitoring platforms and technologies
Advanced experience and experience of AWS technologies
Experience with open source monitoring technologies (Grafana, Prometheus, OpenTelemetry)
Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team
Proven track record for leading and hands on implementing a program/product wide adoption of a data-driven reliability framework by architecting complex, multi-service SLO/SLI standards and institutionalizing error budget policies that effectively balance rapid feature velocity with global system stability
Previous experience with the Site Reliability Engineer (SRE) model and software development using Python or GoLang.
Ability to work in the Raleigh office when needed
Benefits
Comprehensive medical, dental, and vision coverage
Flexible Spending Account - healthcare and dependent care
Health Savings Account - high deductible medical plan
Retirement 401(k) with employer match
Paid time off and holidays
Paid parental leave plans for all new parents
Leave benefits including disability, paid family medical leave, and paid military leave
Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!
OpenShift Fresher assisting Cloud team in managing containerized applications using Red Hat OpenShift. Supporting CI/CD, deployment automation, and cloud - native application environments.
Site Reliability Engineer for Leidos ensuring reliability, performance, and scalability of complex distributed systems for the Navy - Marine Corps Intranet. Collaborating with teams to maintain and optimize network operations and services.
DevOps Engineer evolving banking infrastructure for a fintech company. Focusing on observability, incident response, and platform automation in a hybrid work setup.
Lead DevOps Engineer developing AI - powered supply chain intelligence solutions at S&P Global Mobility. Collaborate with data scientists and engineers to optimize operational infrastructure and continuous delivery processes.
Lead Site Reliability Engineer managing critical IT systems for S&P Dow Jones Indices. Focused on service availability, incident management, and developer collaboration to enhance operational reliability.
Senior SRE Engineer ensuring reliability and performance of AI products at Plaud. Designing scalable systems and leading incident response to improve operational maturity.
Senior DevOps Engineer managing development and deployment pipelines for AI products at Plaud. Optimize infrastructure, enhance productivity, and collaborate with cross - functional teams.
DevOps Engineer supporting big data solutions and AWS infrastructure deployment at Enlighten. Collaborating with teams to ensure reliability, scalability, and performance of cloud services.
Senior Reliability Engineer at Freeport - McMoRan focusing on reliability in copper mining operations. Leading continuous improvement efforts to enhance equipment efficiency and reduce failures.
Senior Platform & Reliability Engineer responsible for enhancing service reliability and infrastructure stability. Leading incident response and implementing durable fixes for a scalable platform.