Senior DevOps Service Reliability Engineer – DGX Cloud at NVIDIA | Hybrid Hired

About the role

Senior DevOps Engineer for NVIDIA's cloud products ensuring high service reliability and availability. Collaborating with cross-functional teams and handling incident management in a 24/7 follow-the-sun environment.

Responsibilities

partner with other key members including Site Reliability Engineering, Security Operations Center, DevOps teams
help make services capable of providing near 100% availability
decrease frequency and duration of any issue
develop monitors, alarms, and alerts to help make the service more reliable
report directly to a manager in the United States
provide their services 24/7 with a follow-the-sun environment
use alerts and alarms to help prevent issues and incidents when possible
work with developers to develop and implement predictive support or diagnostic routines
perform systems administration tasks, network administration tasks, security incident monitoring
develop runbooks which the entire team will use
update and evolve the runbooks as needed
discover incidents and issues, including initiating the incident management procedure
feedback will help us continually improve our service

Requirements

5+ years of experience administering large-scale production systems
3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC)
BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
Expert-level knowledge of Linux system administration and automation using Ansible and/or Python
Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls)
Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment
Strong experience troubleshooting and maintaining large-scale bare-metal infrastructure
Strong cross-team collaboration, documentation, and mentoring skills
Experience improving processes for automation, reliability, and operational excellence
Expertise using monitoring tools and problem ticketing systems
Strong problem-solving, analytical, and troubleshooting abilities

Benefits

equity
benefits

Similar roles

Browse all Devops Engineer jobs

1 hour ago

AT

Infrastructure Engineer – DevOps – Platform Engineering HPC R&D

Atos

Ingénieur Infrastructure DevOps chez Bull, renforçant l'équipe AdminLab Echirolles. Travailler sur des infrastructures Linux et des pratiques d'automatisation dans un environnement HPC.

Onsite Role

Échirolles France Devops Engineer

1 hour ago

AM

Product Quality & Reliability Engineer III

Applied Materials

Product Quality & Reliability Engineer developing quality/reliability standards for Applied Materials. Design methods for testing products and analyze operational data in a supportive team environment.

Onsite Role

Austin United States Devops Engineer

$96,000 - $132,000 per year

3 hours ago

ES

DevOps System Engineer

ESET

DevOps System Engineer creating and managing infrastructure for ESET's global SaaS service. Collaborating with tech teams to maintain secure and stable operations.

Hybrid Role

Bratislava Slovakia Devops Engineer

€2,500 per month

5 hours ago

CA

DevOps Expert II

Capgemini

Provides expertise in business applications design and functionality. Supports users and validates technical designs for alignment with business needs.

Hybrid Role

Pune India Devops Engineer

7 hours ago

BR

Senior Site Reliability Engineer

Broadridge

Senior Site Reliability Engineer supporting the reliability and performance of Broadridge’s fintech platform. Collaborating with senior engineers on automation, infrastructure, and production stability.

Hybrid Role

Makati City Philippines Devops Engineer

7 hours ago

MI

DevOps Engineer, Windows, Azure

Mindera

DevOps Engineer at Mindera focusing on Windows environments and Azure cloud solutions. Involves system modernization, automation, and migration projects with collaborative teams.

Hybrid Role

Chennai India Devops Engineer

8 hours ago

SY

DevSecOps Manager

Synthesized

DevSecOps Lead supporting Synthesized's cloud automation strategy with a focus on security and compliance. Collaborating closely with development teams to shape cloud architecture and enhance deployment processes.

Hybrid Role

London United Kingdom Devops Engineer

8 hours ago

CG

DevOps Engineer

Consort Group

DevOps Engineer managing technical implementation and operational maintenance for Consort Group's ecosystem. Collaborating in project phases and optimizing processes in a hybrid work environment.

Hybrid Role

Lyon France Devops Engineer

€45,000 - €50,000 per year

9 hours ago

AD

DevOps Engineer

AddSecure

DevOps Engineer at AddSecure designing and developing modern cloud infrastructure. Involved with IoT solutions and scaling services using AWS, Azure, and Terraform.

Hybrid Role

Oslo Norway Devops Engineer

10 hours ago

IS

DevOps Engineer

Infotree Global Solutions

Engineer responsible for designing and maintaining SCM, CI/CD, and Software Delivery processes for an international engineering services company. Collaborate in a hybrid environment with advanced technology projects.

Hybrid Role

Wrocław Poland Devops Engineer