About the role

  • Senior ML Infrastructure Engineer developing cloud and compute foundations at Ellison Institute of Technology. Focused on high-performance ML compute clusters to accelerate scientific breakthroughs.

Responsibilities

  • Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
  • Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
  • Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
  • Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
  • Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.

Requirements

  • Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
  • A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
  • Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
  • Expertise with high-throughput storage systems for ML/HPC workloads
  • Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
  • A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)

Benefits

  • Enhanced holiday pay
  • Pension
  • Life Assurance
  • Income Protection
  • Private Medical Insurance
  • Hospital Cash Plan
  • Therapy Services
  • Perk Box
  • Electric Car Scheme

Job title

Senior ML Infrastructure Engineer – AI

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

No Education Requirement

Tech skills

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job