HPC Architect designing high-performance computing solutions at Applied Materials. Focused on optimizing compute, storage and networking for semiconductor manufacturing processes.
Responsibilities
architect high-performance computing solutions from scratch
design/optimize all aspects (Compute, Memory, Networking, Storage) for better cost of Ownership
responsible for designing HPC infrastructure solutions, including compute, networking, storage, and workload management components
work closely with cross-functional teams, including Hardware, Software, product management, and business stakeholders
create and maintain detailed system architecture diagrams and specifications
evaluate and select appropriate hardware and software components for HPC environments
Install, configure, and maintain HPC systems, including hardware, software, and networking components
develop and implement automation scripts for system management and deployment
subject Matter expert to unblock dependent teams in the HPC domain
develop system benchmarks, profile systems to understand bottlenecks, optimize workflows and processes to improve cost of ownership
identify and mitigate technical risks and issues throughout the HPC development life cycle
ensure that Compute Cluster is resilient, reliable, and maintainable
stay abreast of the latest HPC technologies, including Hardware, Software and Networking Solutions
focus on understand the compute workload and design HPC cluster with right combination of Nodes, CPU/GPU, Memory, Interconnects and storage to have optimum performance at minimum cost of Ownership
Requirements
In-depth experience with Linux System administration and Hardware/Software Configuration
Strong knowledge of HPC technologies including cluster computing, high speed interconnects (InfiniBand, RoCE), parallel filesystems (Lustre, GPFS, BeeGFS etc)
Experience in creating, maintaining Operating System images with different installation and boot schemes
Extremely good with automation tools like Ansible, Chef, Salt-Stack and Scripting languages (Python and Bash)
Experience in Creating, maintaining Storage Solutions with different RAID configuration
Ability to design storage solution for different IOPS, Access patterns (Random vs Sequential RW) and tune storage and filesystems for better performance
Good knowledge of Networking concepts including IP addressing, routing, protocols and Switch configuration for RDMA, VLAN configuration, network bonding etc
Good Knowledge Virtualization, Hardware and Software Hypervisors
Good knowledge of containerization technologies like docker, singularity
Experience in Software Defined Networking and Storage
Experience in setting-up remote management protocols like IPMI, Red fish etc.
Experience in setting-up and using monitoring systems like Prometheus, Grafana
Experience System profiling and custom tuning for target workload for higher performance and low cost of ownership
Very good written and verbal communication skills
Very good in Technical documentation meant to serve as manuals for non-experts in the field
Experience in HPC Cluster management and Work-load orchestration software (e.g. SLURM, Torque, LSF)
Experience in Setting-up Deep-learning training/inference solutions
Experience in Private cloud infrastructure like Kubernetes, OpenStack, CloudStack etc.
Experience in Distributed High Performance Computing and Parallel programming frameworks
Good knowledge of Low-latency and high-throughput data transfer technologies (RDMA on RoCE, InfiniBand)
Benefits
supportive work culture that encourages you to learn, develop, and grow your career
commitment to providing programs and support that encourage personal and professional growth
health and wellbeing programs
Job title
Principal Software Architect – High-Performance Computing
Senior Staff Engineer for package layout design at Marvell's semiconductor solutions. Responsible for challenging electrical requirements and collaborating with internal and external teams.
Senior Software Engineer creating Tableau dashboards for Carelon Global Solutions India. Responsible for turning complex data into actionable insights through visualization and collaboration with stakeholders.
Lead Engineer for signal integrity and power integrity at semiconductor company focusing on scalable AI infrastructure. Driving methodologies and validating hardware solutions for Kandou products.
Software Engineer building back - end systems and APIs for Bold Orange, enhancing customer experiences through digital ecosystems. Collaborating with technical teams on multiple client engagements.
Cybersecurity GRC ServiceNow IRM Developer enhancing digital security frameworks for F5. Responsible for configuration, operationalizing business continuity, and ensuring compliance across systems.
Senior Software Engineer developing high - quality software applications for Intapp using .NET technologies and collaborating with cross - functional teams.
Senior Staff Engineer supporting SAP Analytics projects with focus on SAP Analytics Cloud solutions. Role involves analytics requirement analysis and collaboration with stakeholders across functions in a dynamic environment.
Staff Software Engineer designing and building core cloud systems for autonomous drone operations at Skydio. Collaborating with cross - functional teams to deliver impactful features.
Product Engineer responsible for tooling improvement and platform development at Cortavo. Collaborating with teams to optimize internal engineering operations and enhance tech integration.
Software Engineering Intern at Etsy working on application development and gaining hands - on experience. Collaborating with engineers and learning best practices in a real engineering environment.