Software Engineer at NVIDIA designing and operating exascale infrastructure for AI/ML systems. Collaborating with engineers and researchers to enhance AI research capabilities with cutting-edge technology.
Responsibilities
Design, develop, and operate distributed systems that manage data, compute, and networking for large-scale AI workloads.
Build software and automation to orchestrate workloads across thousands of GPUs and petabytes of storage in multi-region clusters.
Collaborate with AI/ML research teams to understand their requirements and translate them into scalable, high-performance solutions.
Drive improvements in system reliability, performance, and observability to meet exascale standards.
Partner with security, networking, and platform teams to ensure that MARS infrastructure meets the highest standards of robustness and compliance.
Participate in design reviews, contribute to system architecture discussions, and influence the evolution of NVIDIA’s AI infrastructure stack.
Stay current with advances in distributed systems, large-scale computing, and AI frameworks to help shape the future direction of MARS.
Requirements
BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field.
8+ years of experience developing and operating large-scale distributed systems, infrastructure platforms, or HPC environments.
Strong programming skills in C++, Python, or Go, with proven experience designing production-quality software systems.
Solid understanding of distributed systems principles, data management, and large-scale orchestration frameworks.
Hands-on experience with high-performance storage (e.g., Lustre, GPFS, BeeGFS) and compute scheduling and orchestration (e.g., Slurm, Kubernetes, LSF).
Familiarity with cloud environments (Azure, AWS, GCP) and infrastructure automation tools.
Strong problem-solving skills, ownership mindset, and the ability to thrive in a fast-paced, collaborative environment.
Excellent communication skills and a track record of cross-functional collaboration.
Senior Database Engineer at Verizon responsible for SQL Server management and NoSQL migration. Involves production support, troubleshooting, and collaborating with application teams.
CitiRisk Credit Technology is seeking a Senior Vice President to lead architectural design and strategic implementation of software solutions. Position involves hands - on coding exceeding 50% of time.
Lead Software Engineer developing core components of high - performance applications for Morgan Stanley. Collaborating with cross - functional teams and enhancing existing components using modern Java practices.
Lead Full Stack Engineer at CoverGo managing development lifecycle and AI integration in our SaaS platform. Oversee team performance and drive innovative solutions in insurance technology.
Lead Full Stack Engineer at CoverGo overseeing development of insurance SaaS solutions. Mentoring engineering teams and collaborating with stakeholders to align technical solutions with business goals.
Software Developer at Kneat enhancing their paperless solutions through backend development and Elasticsearch proficiency. Collaborating with an Agile team in a fast - paced R&D environment.
Senior Software Developer - Backend specializing in Elasticsearch for Kneat's R&D team. Collaborating in Agile environment to enhance product suite and solve complex user problems.
Staff Backend Engineer at SafetyCulture responsible for technical direction of identity and access control systems. Leading architecture decisions and ensuring security for the cloud engineering team.
Back - end Software Engineer developing and enhancing clinical data repositories and APIs at Orion Health. Contributing to engineering best practices and mentoring junior engineers in a hybrid working environment.
Backend Developer at CI&T focusing on APIs and services for a leading Brazilian retailer. Responsible for backend solutions with a strong emphasis on security and integration.