Senior ML Infrastructure Engineer developing cloud and compute foundations at Ellison Institute of Technology. Focused on high-performance ML compute clusters to accelerate scientific breakthroughs.
Responsibilities
Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Infrastructure Systems Engineer II managing production application support for Conduent. Collaborating on ITIL processes and incident management while working in a 24/7 environment.
OT Cybersecurity Specialist responsible for secure IT - OT infrastructures in industrial operations. Engaging in secure deployments, integrating cybersecurity frameworks, and providing expert support.
Ingeniero de Infraestructura y Seguridad colaborando en el diseño de arquitecturas seguras en CRG Solutions. Integrando buenas prácticas de ciberseguridad y gestionando incidentes en entornos Windows y Linux.
Senior Infrastructure Engineer managing global IT infrastructure for aviation solutions, focusing on VMware, Nutanix, and Windows Server environments. Collaborating with teams to ensure high availability and optimal performance in a hybrid work model.
Cloud Support Engineer maintaining operational stability and automation for Azure cloud platforms. Working collaboratively across IT teams to ensure infrastructure reliability and security.
Database Engineer at Aircall building tooling for database management and observability. Working in a fast - paced environment for an innovative customer communications platform.
Lead Cloud Infrastructure Engineer at Paramount managing cloud architecture and infrastructure initiatives across environments. Involved in automation, scalability, and mentoring infrastructure engineers.
Senior Infrastructure Engineer specializing in Cisco and VMware to modernize hybrid environments for strategic partners. Ownership and mentorship role within a collaborative IT team.
Data Cloud & Infrastructure Architect connecting BigQuery potential with Salesforce execution. Mastering identity resolution and driving real - time data orchestration in a hybrid environment.
Infrastructure Engineer developing infrastructure technology for public and private cloud environments. Complying with security and operational requirements, while using automation to enhance product testing.