Join Ellison Institute of Technology as a Senior ML Infrastructure Engineer. Build and operate high-performance ML infrastructure to enable scientific breakthroughs in Oxford.
Responsibilities
**Day-to-day, you might:**
Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
**What makes you a great fit:**
Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
**It would also be great if you had:**
Experience with Lustre
Benefits
**We offer the following salary and benefits:**
Enhanced holiday pay
Pension
Life Assurance
Income Protection
Private Medical Insurance
Hospital Cash Plan
Therapy Services
Perk Box
Electric Car Scheme
**Why work for EIT:**
At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. We value emotional intelligence, empathy, respect, and resilience, and encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!
Infrastructure Engineer modernizing Data Center environments for media content distribution. Involved in technical architecture design and performance optimization for audiovisual workflows.
Senior Infrastructure Engineer responsible for Azure platform architecture and CI/CD pipelines at Oritain. Collaborating with teams to automate and secure infrastructure while enabling fast engineering.
IT Infrastructure Engineer at Sumegre delivering second - level IT support and troubleshooting assistance. Responsible for network infrastructure maintenance and collaboration with server owners to ensure reliability.
Cloud Infrastructure Engineer responsible for Langfuse Cloud operations and observability at scale. Managing AWS and ClickHouse deployment to ensure performance and cost optimization.
Site Infrastructure Engineer managing HVAC and utility systems at SABIC. Overseeing maintenance, project activities, and long - term asset strategies for operational efficiency.
Key engineer developing and operating Web Application Firewall (WAF) platforms at Lloyds Banking Group. Enhancing security and performance while working with modern engineering practices.
Lead Infrastructure Engineer driving Edge Security capabilities for Lloyds Banking Group. Focusing on web access protection, Zero Trust architectures, and modern security engineering approaches.
Senior System Administrator & Infrastructure Engineer managing reliable infrastructure and driving DevOps practices at IMAGO. Collaborating with development teams and providing technical guidance to ensure best practices.
Infrastructure Engineer maintaining high availability of systems at mortgage platform provider Pylon. Focus on developer productivity and codebase quality with instant feedback from peers.
Infrastructure Systems Engineer II managing production application support for Conduent. Collaborating on ITIL processes and incident management while working in a 24/7 environment.