Architect, build, and operate our cloud platform, moving infrastructure beyond the initial setup to deliver resilient compute, network, and storage, including full-sized GPU clusters
Drive the implementation of highly structured, auditable delivery pipelines (CI/CD/GitOps) using to enforce automated, repeatable infrastructure changes
Design and deploy automated governance and security controls using Policy-as-Code (specifically Kyverno and YAML) to ensure strong isolation, protect data, and meet internal audit standards
Establish the foundational monitoring, alerting, and telemetry framework required for robust operations, defining clear SLOs, and setting the course for future SRE work
Partner with Research and Data teams to build self-service capabilities that efficiently support diverse workloads, from Python notebooks to distributed clusters
Requirements
**What makes you a great fit:**
Proven experience platform engineering, with a demonstrable track record of architecting and automating operational processes
A highly proactive attitude and a passion for introducing and automating operational structure
Expertise with at least one major cloud provider (OCI, AWS, GCP, or Azure)
Proficiency with Terraform for declarative, large-scale infrastructure provisioning
Comfortable with operating and managing large-scale, resilient Kubernetes clusters
Proficiency in at least one major language for system-level tools (e.g. Python, Go, or Java) with some scripting experience
**It would also be great if you had:**
Familiarity with modern Policy-as-Code tooling
A passion for introducing and automating operational rigour and structure
Experience supporting ML and Data Engineering workloads
Benefits
**We offer the following salary and benefits:**
Enhanced holiday pay
Pension
Life Assurance
Income Protection
Private Medical Insurance
Hospital Cash Plan
Therapy Services
Perk Box
Electric Car Scheme
**Why work for EIT:**
At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. We value emotional intelligence, empathy, respect, and resilience, and encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!
Principal AWS Platform Engineer at Appvia guiding clients in cloud adoption and DevOps excellence. Leading teams and projects while fostering innovation in cloud technologies.
Platform Engineer building secure and reliable internal platforms for developers at Alto Software Group. Collaborating with cross - functional teams to enhance developer experience and productivity.
Vertica Database Administrator overseeing Vertica systems operations at MassMutual. Providing 24/7 support while ensuring data reliability and security across clustered environments.
Director of Platform Engineering leading the vision, design, and evolution of a developer platform for cloud and infrastructure services. Driving DevOps excellence and automation initiatives across divisions in a strategic role.
Security Engineer developing agent - based tooling and services for NVIDIA's secure software development lifecycle. Collaborating across teams to ensure compliance and security in software development practices.
Power Platform Developer at Macaw creating applications and automating processes with Microsoft technologies. Collaborating with teams to understand requirements and deliver functional solutions.
AI Platform Engineer building and operating secure, scalable components of a cloud AI platform at Elevance Health. Design, implement, and automate cloud services and APIs while improving performance and efficiency.
Platform Engineer focusing on Kubernetes for Bundesdruckerei in Berlin. Supporting a multi - tenant platform with over 80 applications, evaluating new technologies and ensuring automation with infrastructure as code.
Senior Associate Security Platform Engineer monitoring security incidents for leading technology services provider NTT DATA. Collaborating with a 24/7 team on incident response and security tool management.