Join Ellison Institute of Technology as a Senior ML Infrastructure Engineer. Build and operate high-performance ML infrastructure to enable scientific breakthroughs in Oxford.
Responsibilities
**Day-to-day, you might:**
Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
**What makes you a great fit:**
Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
**It would also be great if you had:**
Experience with Lustre
Benefits
**We offer the following salary and benefits:**
Enhanced holiday pay
Pension
Life Assurance
Income Protection
Private Medical Insurance
Hospital Cash Plan
Therapy Services
Perk Box
Electric Car Scheme
**Why work for EIT:**
At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. We value emotional intelligence, empathy, respect, and resilience, and encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!
Modern Workplace Exchange Infrastructure Architect at Avanade driving end - to - end cloud solutions with Microsoft 365. Collaborating with a large team on enterprise projects for digital transformation.
Infrastructure Specialist supporting enterprise voice platforms including Avaya and RingCentral. Balancing transformation with service stability while working in a hybrid environment.
VP of Technology Infrastructure leading multidisciplinary teams at Early Warning. Managing complex infrastructure and influencing company strategy for payment solutions.
Senior Infrastructure Architect II at Pacific Life defining global infrastructure architecture and ensuring alignment with business objectives. Collaborating cross - functionally to support enterprise - wide initiatives.
Responsible for managing IT infrastructure ensuring service availability and security. Leading support teams and overseeing technical projects for Pierre Fabre in Brazil.
Lead Infrastructure Engineer designing secure automation infrastructure for GE Vernova's digital transformation in utility operations. Collaborate with architects to develop reusable IT solutions.
Infrastructure Engineer managing VMware Server Infrastructure for CMA CGM in the UK. Providing L2/L3 support and ensuring smooth IT operations across client environments.
Infrastructure Engineer responsible for IT infrastructure maintenance and user support. Join One Beyond's innovative team to enhance system reliability and performance while working flexibly.
Infrastructure Engineer optimizing cloud infrastructure and costs for blockchain analytics. Join the Core Platform team at Elliptic driving efficiency and scalability.
Software Engineer building infrastructure for Benchling’s biotechnology R&D Cloud platform. Collaborate to enhance developer experience and ensure operational reliability in regulated environments.