Senior HPC and AI Cluster Administrator at NVIDIA specializing in high-performance computing infrastructure. Responsible for deploying and managing AI clusters while supporting R&D initiatives.
Responsibilities
Deploy, manage and maintain large scale HPC/AI clusters
Managing Linux job/workload schedules and orchestration tools
Support and maintain continuous integration and delivery pipelines
Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level
Supporting Research & Development activities and engaging in POCs/POVs for future improvements
Requirements
Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience
5+ years of experience
Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs.
Familiarity with newer and emerging storage technologies.
Python programming and bash scripting experience, automation and configuration management tools such as Jenkins, Ansible, Gitops
Knowledge of Networking Protocols like InfiniBand, Ethernet
Experience with virtual systems (for example VMware, Hyper-V, KVM)
Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)
Benefits
highly competitive salaries
an extensive benefits package
work environment that promotes diversity, inclusion, and flexibility
AI Creator Coach working directly with students to ensure their success through accountability and guidance. Building trust - based relationships to enhance student outcomes and promote services.
Consultant Data & AI role in NORRIQ's Talent Academy focused on Microsoft technology stack and personal development. Engage with exciting projects and mentoring for career advancement.
Business Consultant at Consat Advisory helping clients with AI and digital transformations. Collaborating with management and cross - functional teams to improve efficiency and drive change.
Staff AI/ML Engineer at Welocalize leading complex machine learning projects and mentoring junior engineers. Focusing on ethical practices and driving innovation in AI and ML solutions.
AI Governance Lead ensuring ethical AI solutions and compliance in business travel. Collaborating with cross - functional teams to enhance AI governance frameworks.
Sales Specialist focused on developing AI demand generation initiatives and managing technology provider relationships within Sonda. Responsible for complex corporate sales cycles incorporating integrated solutions.
AI Conversation Designer optimizing voice AI agents for trade businesses to enhance customer interactions. Leading design, testing, and continuous improvement for effective AI conversations.
Physics AI Intern developing algorithms for Physics - Informed Machine Learning at RTX. Collaborating with leading researchers in aerospace to advance physics AI concepts.
PhD intern developing agentic workflows for bioinformatics in a London biotech team. Collaborating with scientists to integrate AI in bioinformatics and metabolomic chemistry applications.
Manager overseeing AI Risk Oversight team within Model Risk Office at Lloyds Banking Group. Conducting oversight activities for AI models and strengthening AI governance frameworks.