Senior Performance and Development Engineer at NVIDIA focusing on optimizing AI workloads and developing scalable AI infrastructure tools. Collaborating with a diverse team to enhance Deep Learning applications.
Responsibilities
Build AI models, tools and frameworks that provide real time application performance metrics that can be correlated with system metrics.
Develop automation frameworks that empower applications to thoughtfully predict and overcome system/infrastructure failures, ensuring fault tolerance.
Collaborate with software teams to pinpoint performance bottlenecks.
Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.
Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.
Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.
Requirements
BS/MS/PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.
12+ years of proven experience in analyzing and improving performance of training applications using PyTorch or similar framework.
Building distributed software applications using collective communication libraries such as MPI or NCCL or UCC.
Construct storage solutions for Deep Learning applications.
Building automated fault tolerant distributed applications.
Building tools for bottleneck analysis and automation of fault tolerance in distributed environments.
Strong background in parallel programming and distributed systems.
Experience analyzing and optimizing large scale distributed applications.
Excellent verbal and written communication skills.
Maintenance Engineer ensuring reliability and safety in industrial engineering at Solvay. Leading maintenance team and optimizing processes for improved productivity and safety.
Component Development Engineer at Tenneco focusing on testing methods and technical partnership with suppliers. Managing documentation for new parts introduction and global cooperation with Tenneco divisions.
Engineer in Electrical Engineering managing hardware and software planning with Siemens TIA Portal. Supporting, modifying, and optimizing machines and components in Germany.
Technician/Engineer responsible for the design of process engineering plants in energy and HVAC. Working with AutoCAD and collaborating on component selection and documentation.
Engineer/Technician in Electrical Engineering responsible for planning electrical systems and optimizing existing machinery. Involved in hardware planning and documentation creation with international travel.
Core Team Engineer developing scalable cloud infrastructure and data solutions at DXC Technology. Leading backend engineering and mentoring future hires in a hybrid work environment.
NPI Engineer ensuring technical support for complex product launches at Plexus. Engaging in mentorship and process improvements during temporary assignments.
Device technologist developing device collateral and managing design rules in Intel's foundry technology. Collaborating with teams to ensure innovative solutions for advanced manufacturing processes.
General Building Maintenance Engineer at Amey, working on facilities management for prisons in the UK. Ensuring compliance and maintenance of security and fire safety systems.
Senior Technology Engineer providing engineering design solutions at Amey for transport infrastructure projects. Leading teams, overseeing engineering designs, and mentoring junior staff.