Product Manager guiding Health Automation and Resilience efforts for AI infrastructure at NVIDIA. Collaborating with engineering to develop fault detection and automated repair workflows.
Responsibilities
Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets.
Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components.
Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention.
Work with cloud providers and enterprise operators to understand failure modes and operational challenges.
Develop product specifications, technical requirements, and validation criteria for both internal and open-source components.
Support go-to-market activities including documentation, demos, partner enablement, and release readiness.
Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy.
Lead product technical reviews, customer conversations, and planning sessions.
Requirements
Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience.
8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields.
Track record defining multi-quarter strategy and leading execution with multiple engineering teams.
Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows.
Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems.
Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments.
Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments.
Experience working with open-source technologies or products for software developers.
Excellent communication skills across engineering, customers, and executives.
Senior Manager responsible for Polaris integration product ownership at Pfizer, overseeing tech solutions for improved patient engagement and business growth.
Service Product Manager overseeing Routing Infrastructure Systems at HPE. Strategizing service portfolio and integrating Aruba and Juniper routing services into unified offerings.
Director of Product Management responsible for leading field technology improvements for GSK. Collaborating across departments to drive sales system effectiveness and user adoption.
Product Owner at ACA Group focusing on bridging back office operations and business agility. Managing the quote to cash landscape including Salesforce.com and WorkDay with integrated vision.
Clinical Labs Product Management Intern at QuidelOrtho responsible for market research and product strategies while working in a hybrid role. Gaining hands - on experience in the medical device/biotech industry for summer internships.
Intern supporting Global Product Management in medical diagnostics at QuidelOrtho. Analyze market trends and assist in product and marketing strategies while gaining hands - on experience.
Liaising between BEEP team and IT for enhancing PG&E’s Energy Insight system. Leading product management processes to support effective building electrification and energy efficiency programs.
Senior Manager, Product Management leading product strategy and execution at SOCi. Responsible for managing a team and collaborating cross - functionally to drive customer value and business growth.
Senior Product Manager shaping digital tools supporting fleet operations in Corpay. Leading product strategy and cross - functional collaboration to enhance customer experiences.