HPC scheduler/resource manager engineer crafting scheduling strategies for large datacenter clusters. Driving cutting-edge innovations in AI and GPU computing with top scientific partners and technologies.
Responsibilities
Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters
Drive next generation requirements and features for schedulers in at scale clusters
Ensure technical relationships with internal and external engineering teams
Assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology
Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community
Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance
Requirements
BS, MS, or PhD in Engineering, Mathematics, Physics, Computer Science, or equivalent experience
12+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions
Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...)
Proven understanding of performance clusters, infrastructure and workload patterns
Experience using and installing Linux-based server platforms
Engineer designing, planning, and implementing cloud infrastructure for diverse clients in Defence Enterprise Business Unit. Support operations and manage system/network infrastructure projects effectively.
Project Engineer - Electrical delivering engineering projects to support safe and efficient mining operations at Ernest Henry. Collaborating with teams for successful project execution and electrical system management.
Mine Planning Engineer responsible for developing underground mine designs and schedules for Evolution Mining. Collaborating with planning, scheduling, and underground operations teams for efficient execution.
Load Calculation Engineer supporting certification activities and load calculation for wind turbine compliance. Requires advanced knowledge in wind‑turbine theory and proficiency with specific tools.
Engineer responsible for assuring software quality for Windfarm Control by developing programs and defining test cases. Collaborating with different departments in an international environment.
Software Engineer 3 at Newport News Shipbuilding collaborating on software requirements development and validation for naval systems. Conducting multidisciplinary research and ensuring compliance with software standards.
Mechanical M&R Engineer at LyondellBasell supporting Bayport Polymers Plant asset maintenance strategy. Collaborating across disciplines and applying data analysis for performance improvements.
Manufacturing Engineer Intern supporting development and documentation of aerospace hydraulic actuator production processes. Collaborating with teams to improve product flow and quality while utilizing CAD tools.
Process Engineer focused on continuous improvement in food manufacturing, leading projects and mentoring teams. Collaborating with plant leadership to implement lean manufacturing principles.
Process Engineer leading continuous improvement initiatives in manufacturing at Ventura Foods. Focusing on Lean manufacturing and process improvement projects to enhance operational efficiency.