Site Reliability Engineer supporting Red Hat's software manufacturing services on hybrid cloud infrastructure. Collaborating with development, quality engineering, and release engineering to maintain high reliability.
Responsibilities
Be part of a globally distributed team, offering 24x7 support through a service model that leverages different time zones to extend coverage with regular on-call rotations.
Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
Collaborate on incident retrospective reviews and corrective items implementation.
Configure and maintain service infrastructure.
Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
Coordinate your actions with other Red Hat teams such as IT Platforms, Infrastructure, Storage and Network and ensure our services cloud deployment meets quality expectations.
Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
Requirements
Expert knowledge of OpenShift administration and application development
Linux administration expertise
Advanced knowledge of automation services: ArgoCD, Ansible or Terraform
Advanced knowledge of CI/CD platforms: Tekton and Pipelines as a code (optionally GitHub Actions or Jenkins)
Advanced knowledge and experience with monitoring platforms and technologies
General knowledge of AWS technologies
Ability to understand graphically represented concepts and architectures in documentation
Experience with creation of Standard Operating Procedures
Knowledge of open source monitoring technologies (Grafana, Prometheus, OpenTelemetry)
Excellent written and verbal communication skills in English
Previous experience with SRE model (a plus)
Experience with software development using Python or GoLang (a plus)
Experience with automation design and implementation (a plus)
Senior DevOps Engineer managing DevOps processes and tooling for customer - facing platforms at Luminor. Building CI/CD pipelines and providing production support with a focus on mentoring and collaboration.
Building and maintaining DevOps processes and CI/CD pipelines for Luminor's banking champion. Collaborating in a flexible work environment with international teams.
Senior DevOps Engineer at Luminor, a leading bank in the Baltics, managing customer - facing platforms and infrastructure. Building CI/CD pipelines and mentoring junior engineers.
Sr. Site Reliability Engineer designing and automating robust technical infrastructure at Broadridge. Collaborating across teams for successful deployment and operational support of services.
Senior Fleet Reliability Engineer maintaining high fleet uptime for autonomous vehicle technology. Collaborating with technical teams to ensure peak operational performance in data collection efforts.
DevOps Lead at Leidos managing platform engineering, SRE, and application security functions. Driving operational excellence and ensuring scalability for federal government applications.
SRE Lead developing scalable cloud - native solutions for mission - critical systems supporting USAF. Managing teams, collaborating with cross - functional units, and ensuring high service reliability standards.
Junior DevOps / Platform Engineer at DieEnergiekoppler GmbH managing AWS/EKS platform operations. Collaborating with team members to improve platform functionalities and security compliance.
DevOps Engineer responsible for AWS infrastructures and backend development at Allguth GmbH. Engaging in greenfield projects with modern solutions in a collaborative team.
Cloud DevOps Specialist responsible for building scalable infrastructure solutions in AWS at SONDA. Focusing on automation, containerization, and data management in a collaborative environment.