Technical Program Manager driving AI infrastructure with external partners at NVIDIA. Collaborating with engineering and infrastructure teams to enhance AI capacity and management.
Responsibilities
As a DGX Cloud Technical Program Manager, you'll be a key partner to our Engineering, Infrastructure, Software teams and their leadership, driving critical programs related to AI capacity enablement and management .
You'll play a pivotal role in developing and maturing foundational capabilities and processes for DGX Cloud, spanning critical areas such as cluster/capacity bring-up including CPU, storage, networking and compute requirements to support GPUs.
This is a dynamic, fast-paced environment where TPMs are expected to apply fungible skillsets to a range of high-impact programs across DGX Cloud.
Collaborating closely with storage engineering and network engineering teams to define and communicate requirements to CSP (Cloud Service Providers) and NCP’s (NVIDIA Cloud Providers).
Drive alignment and a POR for capacity blocks based on workload needs.
Drive early engagement with CSP (Cloud Service Providers) and NCP’s (NVIDIA Cloud Providers) to understand their managed storage, network solutions and influence alignment with NVIDIA Cloud roadmap
Gathering technical requirements, developing comprehensive roadmaps, establishing clear milestones, and ensuring adherence to our Product Lifecycle (PLC) process.
Managing ongoing capacity operations and the engineering engagement with CSP (Cloud Service Providers) and NCP’s (NVIDIA Cloud Provider) partners, collaborating closely with an SRE lead.
Focus on availability, maintenance and other critical performance indicators.
Partner closely within NVIDIA to understand workload requirements, related HW and infra needs, including speeds/feeds to optimize and infrastructure readiness with CSP (Cloud Service Providers) and NCP’s (NVIDIA Cloud Providers).
Leveraging Jira and other program management platforms to instill rigor and structure in the management of engineering deliverables.
Identifying and driving opportunities to onboard the adoption of third-party and in-house cloud infrastructure solutions for deployments, support, security, compliance and observability across DGX Cloud
Establishing key performance indicators (KPIs) and quantitatively demonstrating the value and impact delivered by your programs.
Proactively identifying, resolving, and mitigating risks and issues that could affect scope, schedule, and quality across all program aspects.
Cultivating a culture of continuous improvement, consistently identifying opportunities for process enhancements within our cloud infrastructure operations.
Requirements
12+ years of technical program management experience, specifically driving the planning and execution of large-scale cloud infrastructure programs with external partners, with a strong focus on software engineering projects within a matrixed organization.
Extensive hands-on experience in cloud infrastructure, preferably gained from working at a major Cloud Service Provider (CSP).
Domain knowledge in the bring-up and end to end operations of compute, storage, networking and GPU (including common failure points at the HW and SW levels).
Expert-level proficiency with Jira, Smartsheet, or similar program management tools, with the ability to confidently guide engineering teams on their use of the tools.
Exceptional strategic and tactical thinking abilities, coupled with a strong capacity to build consensus and drive program success
Comfort and effectiveness in thriving within ambiguous environments.
Possess excellent communication and technical presentation skills, particularly for executive audiences.
BS or MS in Electrical Engineering or Computer Science, or equivalent experience.
Cloud Engineer designing, building, and supporting scalable microservices at Tiger Analytics. Join a fast - growing analytics consulting firm working with Fortune 500 companies.
Cloud Application Developer for Marine Corps designing and maintaining cloud applications. Ensuring operational efficiency and compliance with DoD mandates through intuitive and secure digital tools.
Cloud Operations Engineer at Ness Digital Engineering providing operational support across various applications. Collaborating globally for issue resolution and maintaining customer satisfaction.
Infrastructure Cloud Engineer for AWS environments focused on Terraform optimization and security. Join a team to ensure performance and efficiency in a hybrid workspace setup.
Azure Cloud Engineer focusing on designing Azure cloud infrastructures and migrating on - premise systems. Collaborating with teams on cloud strategies in a hybrid work environment.
Azure Cloud Engineer designing and maintaining Azure cloud infrastructures for a company of 400 employees. Involves migration, automation, and compliance efforts within a hybrid work setting.
Cloud Developer at Hewlett Packard Enterprise working on innovative cloud - based solutions. Engaging in software systems design, development, and collaboration with teams for effective delivery.
Google Cloud Developer developing scalable cloud - native applications and collaborating with clients in Guatemala. Engaging in architectural improvements and ensuring system reliability on Google Cloud.
Cloud Engineer - AWS managing customer cloud environments and providing expertise in cloud transformation at Datacom. Leading customer enablement in cloud technology and modern practices.