Senior DevOps Engineer for NVIDIA's cloud products ensuring high service reliability and availability. Collaborating with cross-functional teams and handling incident management in a 24/7 follow-the-sun environment.
Responsibilities
partner with other key members including Site Reliability Engineering, Security Operations Center, DevOps teams
help make services capable of providing near 100% availability
decrease frequency and duration of any issue
develop monitors, alarms, and alerts to help make the service more reliable
report directly to a manager in the United States
provide their services 24/7 with a follow-the-sun environment
use alerts and alarms to help prevent issues and incidents when possible
work with developers to develop and implement predictive support or diagnostic routines
perform systems administration tasks, network administration tasks, security incident monitoring
develop runbooks which the entire team will use
update and evolve the runbooks as needed
discover incidents and issues, including initiating the incident management procedure
feedback will help us continually improve our service
Requirements
5+ years of experience administering large-scale production systems
3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC)
BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
Expert-level knowledge of Linux system administration and automation using Ansible and/or Python
Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls)
Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment
Strong experience troubleshooting and maintaining large-scale bare-metal infrastructure
Strong cross-team collaboration, documentation, and mentoring skills
Experience improving processes for automation, reliability, and operational excellence
Expertise using monitoring tools and problem ticketing systems
Strong problem-solving, analytical, and troubleshooting abilities
Benefits
equity
benefits
Job title
Senior DevOps Service Reliability Engineer – DGX Cloud
Sr. Site Reliability Engineer designing and automating robust technical infrastructure at Broadridge. Collaborating across teams for successful deployment and operational support of services.
Senior Fleet Reliability Engineer maintaining high fleet uptime for autonomous vehicle technology. Collaborating with technical teams to ensure peak operational performance in data collection efforts.
DevOps Lead at Leidos managing platform engineering, SRE, and application security functions. Driving operational excellence and ensuring scalability for federal government applications.
SRE Lead developing scalable cloud - native solutions for mission - critical systems supporting USAF. Managing teams, collaborating with cross - functional units, and ensuring high service reliability standards.
Junior DevOps / Platform Engineer at DieEnergiekoppler GmbH managing AWS/EKS platform operations. Collaborating with team members to improve platform functionalities and security compliance.
DevOps Engineer responsible for AWS infrastructures and backend development at Allguth GmbH. Engaging in greenfield projects with modern solutions in a collaborative team.
Cloud DevOps Specialist responsible for building scalable infrastructure solutions in AWS at SONDA. Focusing on automation, containerization, and data management in a collaborative environment.
DevOps Engineer maintaining and evolving deployment pipelines for Docebo’s AI - powered learning platform. Collaborating with cross - functional teams to ensure efficient software releases and infrastructure management.
DevOps Engineer optimizing CI/CD pipelines for Docebo, an AI - powered learning platform. Involves managing multi - tenant infrastructure using AWS, Docker, and Kubernetes.
DevOps Engineer maintaining and automating infrastructure and CI/CD processes for cybersecurity solutions by NordLayer. Collaborating with teams to ensure performance and scalability of cloud services.