Senior ML Infrastructure Engineer developing cloud and compute foundations at Ellison Institute of Technology. Focused on high-performance ML compute clusters to accelerate scientific breakthroughs.
Responsibilities
Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Manager of Platform Engineering at a leading insurance company shaping the future of API platforms. Fostering innovation and collaboration while driving platform stability and resiliency.
Infrastructure Engineer responsible for building, monitoring, and securing IT infrastructure for NLACRC. Collaborates with IT personnel and external support to ensure robust infrastructure.
Infrastructure Engineering Intern working on cloud solutions at a global growth engine for commerce. Collaborating on secure, scalable systems and contributing to performance optimization.
Infrastructure Engineer supporting IT service management and implementing complex system solutions. Collaborating with business units and training junior team members in a hybrid environment.
Lead Infrastructure Engineer focusing on web access protection and security strategies at Lloyds Banking Group. Managing infrastructure improvements and team leadership in enterprise environments.
Infrastructure Engineering Lead overseeing edge security initiatives for Lloyds Banking Group. Driving the development of security capabilities and mentoring engineering teams.
Senior Infrastructure Engineer maintaining IT infrastructure and datacentre operations for Walkers Global. Installing, configuring, and troubleshooting various hardware and cloud services in a hands - on role.
Infrastructure Architect responsible for designing and implementing multi - cloud infrastructures. Collaborating with teams to ensure high availability, security, and cost efficiency in cloud environments.
Senior Database Administrator specializing in private cloud technologies for fintech company's modernization agenda. Focused on database platform engineering with MS SQL and PostgreSQL.
Infrastructure Engineer II at Bank of America responsible for system engineering activities. Leading design, development, and implementation of complex infrastructure tools and services.