Hybrid Product Manager, Health Automation, Resilience

Posted 3 weeks ago

Apply now

About the role

  • Product Manager guiding Health Automation and Resilience efforts for AI infrastructure at NVIDIA. Collaborating with engineering to develop fault detection and automated repair workflows.

Responsibilities

  • Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets.
  • Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components.
  • Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention.
  • Work with cloud providers and enterprise operators to understand failure modes and operational challenges.
  • Develop product specifications, technical requirements, and validation criteria for both internal and open-source components.
  • Support go-to-market activities including documentation, demos, partner enablement, and release readiness.
  • Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy.
  • Lead product technical reviews, customer conversations, and planning sessions.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience.
  • 8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields.
  • Track record defining multi-quarter strategy and leading execution with multiple engineering teams.
  • Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows.
  • Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems.
  • Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments.
  • Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments.
  • Experience working with open-source technologies or products for software developers.
  • Excellent communication skills across engineering, customers, and executives.

Benefits

  • Equity
  • Benefits

Job title

Product Manager, Health Automation, Resilience

Job type

Experience level

SeniorLead

Salary

$168,000 - $258,750 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job