Senior Engineering Manager leading Data Center telemetry solutions at NVIDIA, driving architecture, development, and deployment for AI supercomputing platforms. Recruiting and managing top talent to optimize data center performance.
Responsibilities
Own the end-to-end architecture and delivery for telemetry solutions, including fleet health monitoring, fault remediation, and data visualization at scale
Own OOB telemetry solution and data validation for telemetry from each underlying device
Recruit, develop, and motivate a high-performing engineering team focused on platform telemetry, RAS and observability
Continuously improve software development processes for optimal productivity and quality
Work across teams to ensure seamless integration of telemetry solutions with platform firmware, server architecture, and data center management
Drive product life cycles with QA teams, ensuring robust testing, productization, and delivery
Conduct performance reviews, foster a culture of excellence, and ensure high productivity
Requirements
12+ overall years of relevant experience
5+ years of managing systems/platform software teams
BS, MS, or PhD in EE/CS or related field (or equivalent experience)
Strong knowledge of DMTF/PLDM for OOB telemetry collection
Time series databases (e.g., InfluxDB, Prometheus) and REST APIs (Redfish)
Deep understanding of Server and firmware architecture and optimization for low-latency APIs
Proven track record of delivering scalable server products and telemetry solutions
Experience with SCM (Git, Perforce) and project management tools (Jira)
Hands-on experience with x86/ARM system architecture and coding (C/C++, Python)
Familiarity with Confidential Compute and notification systems
Demonstrated ability to analyze algorithms for time/space complexity and system resource requirements
Benefits
Equity
Benefits
Job title
Senior Manager, Engineering – Data Center Telemetry, RAS
Siemens Polarion Specialist (Developer) responsible for developing and customizing ALM solutions at Hitachi Energy. Collaborating with engineering, quality, and IT teams to ensure lifecycle management and compliance.
Product Review Engineer at Boeing required to research and develop solutions to product/process issues. Collaborating with teams to assist in technical communications and product improvement.
Boeing Consultant (Level 5) Liaison Engineer responsible for hands - on engineering and production support. Engaging with production personnel to resolve technical issues and implement improvements.
PL/SQL Developer at Birlasoft specializing in Inventory Management and Supply Chain Management. Responsible for developing, enhancing, and maintaining PL/SQL packages and ensuring data exchange between applications.
Distribution Operations Engineering Intern supporting utility systems and engineering analysis. Collaborating with cross - functional teams and improving data accuracy in distribution operations.
Manufacturing Engineering Intern supporting body shop processes, learning hands - on analysis, troubleshooting, and collaborating with engineering teams.
Junior Developer for Babcock, an international defense company. Responsible for developing web applications and APIs using modern technologies in Agile environment.
Head of Product & Engineering responsible for leading a SaaS platform's global scaling efforts. Collaborating with founders to define product strategies and enhance engineering practices.
Engineering internship providing support to water and wastewater infrastructure projects. Collaborating with a multidisciplinary team at Stantec to gain real - world experience.
Manufacturing Engineering Technician providing NC programming support for CNC and plasma systems at Prolec. Assisting with tooling development and engineering projects to boost production efficiency.