Hybrid Senior Software Engineer, AI Resiliency

Posted 3 weeks ago

Apply now

About the role

  • Senior Software Engineer developing AI software resiliency for powerful AI supercomputers. Leading efforts to improve reliability and robustness for large-scale AI workloads at NVIDIA.

Responsibilities

  • Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
  • Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code.
  • Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios.
  • Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
  • Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms.
  • Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.

Requirements

  • Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
  • Proficiency in C++ and Python , with experience in writing efficient, high-performance code.
  • 6+ years of relevant experience
  • Strong understanding of distributed systems concepts , parallel programming, and fault tolerance in large-scale computing environments.
  • Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar.
  • Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight).
  • Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.

Benefits

  • equity
  • benefits

Job title

Senior Software Engineer, AI Resiliency

Job type

Experience level

Senior

Salary

$184,000 - $287,500 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job