Senior ML Infrastructure Engineer developing cloud and compute foundations at Ellison Institute of Technology. Focused on high-performance ML compute clusters to accelerate scientific breakthroughs.
Responsibilities
Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Infrastructure Engineer focusing on data encryption solutions for on - premise, mobile, and cloud environments. Collaborating with teams to design and build secure systems and solutions.
Senior Cloud Infrastructure Engineer developing and operating Kubernetes services for CLOUD4EUROPE project. Supporting team by creating solutions and participating in a 24x7 on - call schedule.
Cloud Infrastructure Engineer at Leonardo, focusing on cloud - based systems design and optimization. Responsible for ensuring security and performance in complex cloud architectures.
IT Consultant/IT Infrastructure Specialist advising clients on infrastructure projects, primarily in Microsoft/Active Directory, VMware, or Citrix. Analyzing and optimizing IT infrastructures through innovative solutions.
Senior Infrastructure Engineer responsible for managing and modernizing corporate infrastructure at ICEYE. Focused on enabling scalable global satellite operations and ensuring operational excellence.
Civil Design Engineer developing infrastructure for electric vehicle charging stations. Leading design processes and ensuring regulatory compliance for EV charger installations.
Infrastructure Engineer building an AI - powered platform for crisis management in Sweden. Collaborating with cross - functional teams to save lives through innovative tools and solutions.
Infrastructure Architect at Pague Menos responsible for designing secure, scalable IT infrastructure architectures. Collaborating with teams to implement and optimize both on - premise and cloud solutions.
IT Infrastructure Specialist managing physical and virtual server environments for Premier League Studios. Ensuring robust workflows and high - performance infrastructure in a hybrid work setting.
Manager of Platform Engineering at a leading insurance company shaping the future of API platforms. Fostering innovation and collaboration while driving platform stability and resiliency.