Onsite HPC and AI Cluster Engineer

Posted yesterday

Apply now

About the role

  • HPC and AI Cluster Engineer maintaining large scale HPC/AI clusters for NVIDIA's Networking clusters solutions. Engaging with researchers and developers to optimize workflows and deliver solutions.

Responsibilities

  • Deploy, manage and maintain large scale HPC/AI clusters
  • Managing Linux job/workload schedules and orchestration tools
  • Support and maintain continuous integration and delivery pipelines
  • Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level
  • Supporting Research & Development activities and engaging in POCs for future improvements

Requirements

  • Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience
  • 3+ years of experience
  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
  • Python programming and bash scripting experience, automation and configuration management tools such as Jenkins, Ansible, Gitops
  • Experience with virtual systems (for example VMware, Hyper-V, KVM)

Benefits

  • Competitive salaries
  • Extensive benefits package
  • Work environment that promotes diversity, inclusion, and flexibility

Job title

HPC and AI Cluster Engineer

Job type

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job