Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
**What makes you a great fit:**
Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
**It would also be great if you had:**
Experience with Lustre
Benefits
**We offer the following salary and benefits:**
Enhanced holiday pay
Pension
Life Assurance
Income Protection
Private Medical Insurance
Hospital Cash Plan
Therapy Services
Perk Box
Electric Car Scheme
**Why work for EIT:**
At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. We value emotional intelligence, empathy, respect, and resilience, and encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!
Senior Infrastructure Engineer supporting IT infrastructure implementation and maintenance at INTEGRIS Health. Involves mentoring, troubleshooting, and system optimization responsibilities in a hybrid work setting.
Senior BizOps Infrastructure Engineer managing global IT infrastructure at Simply Business. Collaborating on projects and driving automation in a cloud - first environment.
Linux Infrastructure Specialist managing the implementation and maintenance of Linux infrastructure for Morgan Stanley. Collaborating with infrastructure teams to ensure operational stability and compliance for production environments.
Senior Azure Infrastructure Engineer responsible for designing and managing Azure cloud solutions. Collaborating with development and IT operations to deploy and optimize cloud environments.
IT Infrastructure Engineer leading Hardware & Virtualization team at Optasia for financial technology solutions. Overseeing infrastructure stability, capacity planning, and team mentorship in Athens, Greece.
Cloud Engineer I at Travelers focusing on cloud automation, infrastructure design, and service management. Collaborating with teams to modernize cloud provisioning and improve operational efficiency.
Senior Cloud Infrastructure Engineer responsible for designing Azure infrastructure for healthcare AI. Collaborating with teams to enhance reliability, security, and compliance in cloud services.
IT Infrastructure Specialist overseeing hybrid IT infrastructure systems in global SaaS company. Responsible for system stability, security, and collaboration with engineering teams.
Infrastructure Engineer at Nexpublica focusing on cloud and internal system security and maintenance while ensuring scaling and resilience in infrastructure.
Regional Infrastructure Engineer managing WASH and building construction projects at Area Programs in Uganda. Collaborating with teams to ensure implementation meets international standards and guidelines.