Senior Systems Engineer responsible for HPC cluster management and optimization at Rackspace. Collaborating with scientists and handling technical support for high-performance computing.
Responsibilities
Install, configure, and maintain HPC clusters (hardware, software, operating systems).
Perform regular updates/patching and manage user accounts and permissions.
Troubleshoot/resolve hardware or software issues.
Monitor and analyze system and application performance, identify bottlenecks and implement tuning solutions.
Manage job scheduling and resource allocation using tools such as Slurm, LSF, Bright Cluster Manager, OpenHPC, and Warewulf.
Configure Linux networking (TCP/IP, DNS, routing) and HPC interconnects (InfiniBand, Ethernet).
Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS) ensuring data integrity and managing backups.
Implement security controls and manage authentication services like LDAP and Active Directory.
Automate deployments and system configurations using tools like Ansible, Terraform, Jenkins, and Git.
Provide technical support, documentation, and training to researchers and collaborate with scientists and HPC architects.
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related field (equivalent experience may substitute for degree).
Minimum of 10 years of systems experience, including at least 5 years working specifically with HPC.
Strong knowledge of Linux operating systems (e.g., Rocky Linux, Ubuntu) with a fundamental understanding of Linux internals, system administration, and performance tuning.
Experience building and managing RPM and DEB packages.
Experience with cluster management tools such as Bright Cluster Manager, OpenHPC stack, or Warewulf.
Proficiency with job schedulers and resource managers such as Slurm and LSF.
Strong understanding of Linux networking (e.g., TCP/IP, DNS, routing) and HPC interconnects (e.g., InfiniBand, Ethernet) including performance tuning.
Knowledge of parallel file systems such as Lustre, Ceph, or GPFS.
Working knowledge of Linux authentication and directory services such as LDAP and Active Directory.
Proficiency in scripting languages (e.g., Python, Bash, R) and familiarity with MPI libraries for parallel and distributed computing (nice to have).
Strong experience with DevOps and configuration management tools, including Ansible, Terraform, Jenkins, and Git.
Knowledge of HPC in cloud environments (e.g., AWS, Azure, GCP HPC offerings) is a plus.
Strong knowledge of Linux security, compliance standards, and data protection best practices.
Excellent communication, interpersonal, and problem-solving skills.
System Engineer for Windows and Linux patch management and packaging at AJAT GmbH. Collaborating in a small, dynamic team to ensure system security and software deployment.
MES Senior Systems Engineer enhancing Manufacturing Execution Systems within a global pharmaceutical environment. Ensures stable operation, compliance, and continuous improvement of MES applications and environments.
Principal Systems Engineer III for E - INFOSOL designing and maintaining data center networking infrastructure. Requires active Top - Secret clearance and extensive networking experience.
Senior Systems Analyst supporting the onboarding of applications in a high - availability enterprise platform at E - INFOSOL. Focused on customer engagement and requirements documentation.
Senior Developer leading the development of a custom CopyTrader system using cTrader for fintech clients. Responsible for system integration and high - performance architecture in a hybrid setting.
Senior MCU System Engineer designing and developing MCU - based high - performance and zonal controllers for SDV at 42dot. Involves hardware abstraction and system optimization tasks.
Linux System Engineer at the Allen Institute managing IT infrastructure for scientific computing with over 400 servers. Deploying cloud services and engaging in lifecycle management.
Senior OneID Application and Systems Developer improving identity management for the Department of Agriculture. Focusing on application development and integration within the One Identity Manager platform.
Senior Developer enhancing One Identity Manager solutions for the Department of Agriculture, Fisheries and Forestry. Collaborating with stakeholders to implement custom solutions and integrations.