Senior Software Engineer developing AI software resiliency for powerful AI supercomputers. Leading efforts to improve reliability and robustness for large-scale AI workloads at NVIDIA.
Responsibilities
Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code.
Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios.
Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms.
Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.
Requirements
Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
Proficiency in C++ and Python , with experience in writing efficient, high-performance code.
6+ years of relevant experience
Strong understanding of distributed systems concepts , parallel programming, and fault tolerance in large-scale computing environments.
Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar.
Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight).
Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.
Senior Software Engineer developing high - performance diagnostic tools for NVIDIA’s networking platforms. Collaborating with teams for innovative solutions and ensuring hardware stability in high - performance computing environments.
Software Engineer designing and developing AI networking protocols for NVIDIA's cutting - edge technology. Collaborate with customers and handle all aspects of network driver development.
Software Developer Engineer in Networking at NVIDIA designing and verifying high - speed communication devices. Working closely with customers on product solutions across multiple platforms.
Full - Stack Developer responsible for developing features and improving processes at GovTech startup SUMM AI. Building AI solutions that create societal value in the public sector.
Senior Engineer developing AI tools for an early stage startup in Munich. Expected to build AI Agents and enhance frontend and backend applications while collaborating with the Co - Founder.
Senior Software Engineer at Anansi Solutions developing impactful client projects in a hybrid environment. Collaborating with teams and building internal tools while mentoring junior professionals.
Internship in System Integration & Deployment at Think3DDD focusing on Docker, Linux, and Cloud environments. Learning to deploy web systems and work with modern technologies in an innovative startup.
Senior Product Engineer at Replit leading initiatives for innovation in software creation platforms focused on next generation creators. Collaborating on disruptive projects in a high - visibility role.
Tech Lead in Applied Computer Vision Algorithms at Niantic Spatial. Driving innovations in geospatial AI and 3D reconstruction with a high - performance software team.
Senior Software Engineer developing scalable, high - quality software for Open Government Products. Engaging in cross - functional collaboration and driving public service innovations through technology.