Hybrid Principal Site Reliability Engineer

Posted last month

Apply now

About the role

  • Site Reliability Engineer supporting Red Hat's software manufacturing services on hybrid cloud infrastructure. Collaborating with development, quality engineering, and release engineering to maintain high reliability.

Responsibilities

  • Be part of a globally distributed team, offering 24x7 support through a service model that leverages different time zones to extend coverage with regular on-call rotations.
  • Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
  • Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
  • Collaborate on incident retrospective reviews and corrective items implementation.
  • Configure and maintain service infrastructure.
  • Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
  • Coordinate your actions with other Red Hat teams such as IT Platforms, Infrastructure, Storage and Network and ensure our services cloud deployment meets quality expectations.
  • Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
  • Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.

Requirements

  • Expert knowledge of OpenShift administration and application development
  • Linux administration expertise
  • Advanced knowledge of automation services: ArgoCD, Ansible or Terraform
  • Advanced knowledge of CI/CD platforms: Tekton and Pipelines as a code (optionally GitHub Actions or Jenkins)
  • Advanced knowledge and experience with monitoring platforms and technologies
  • General knowledge of AWS technologies
  • Ability to understand graphically represented concepts and architectures in documentation
  • Experience with creation of Standard Operating Procedures
  • Knowledge of open source monitoring technologies (Grafana, Prometheus, OpenTelemetry)
  • Excellent written and verbal communication skills in English
  • Previous experience with SRE model (a plus)
  • Experience with software development using Python or GoLang (a plus)
  • Experience with automation design and implementation (a plus)

Benefits

  • Health insurance
  • 401(k) matching
  • Flexible work hours
  • Professional development opportunities

Job title

Principal Site Reliability Engineer

Job type

Experience level

Lead

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

HybridPuneIndia

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job