Hybrid Senior DevOps Service Reliability Engineer – DGX Cloud

Posted 5 days ago

Apply now

About the role

  • partner with other key members including Site Reliability Engineering, Security Operations Center, DevOps teams
  • help make services capable of providing near 100% availability
  • decrease frequency and duration of any issue
  • develop monitors, alarms, and alerts to help make the service more reliable
  • report directly to a manager in the United States
  • provide their services 24/7 with a follow-the-sun environment
  • use alerts and alarms to help prevent issues and incidents when possible
  • work with developers to develop and implement predictive support or diagnostic routines
  • perform systems administration tasks, network administration tasks, security incident monitoring
  • develop runbooks which the entire team will use
  • update and evolve the runbooks as needed
  • discover incidents and issues, including initiating the incident management procedure
  • feedback will help us continually improve our service

Requirements

  • 5+ years of experience administering large-scale production systems
  • 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC)
  • BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
  • Expert-level knowledge of Linux system administration and automation using Ansible and/or Python
  • Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls)
  • Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment
  • Strong experience troubleshooting and maintaining large-scale bare-metal infrastructure
  • Strong cross-team collaboration, documentation, and mentoring skills
  • Experience improving processes for automation, reliability, and operational excellence
  • Expertise using monitoring tools and problem ticketing systems
  • Strong problem-solving, analytical, and troubleshooting abilities

Benefits

  • equity
  • benefits

Job title

Senior DevOps Service Reliability Engineer – DGX Cloud

Job type

Experience level

Senior

Salary

$144,000 - $270,250 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job