Site Reliability Engineer joining Spotify’s Backstage team, building intelligent infrastructure for the world's most popular audio streaming service. Contributing to AI-native workflows and developer experience.
Responsibilities
Orchestrate the Fleet: Maintain and improve Portal’s SaaS infrastructure for reliability, security, and scalability. This covers the runtime environments supporting the platform and workflows powered by large language models.
Modern Infra-as-Code: Collaborate with senior engineers to build infrastructure on GCP and AWS using Terraform and emerging infrastructure-from-code patterns where agents assist in defining the stack.
Support Fullstack Systems: Operate in a modern web stack environment (TypeScript, React, Python). While this isn’t a frontend-heavy role, comfort with debugging fullstack systems and web infrastructure is key.
Reliability Engineering: Participate in on-call rotations to ensure systems meet reliability and availability goals, employing AI assistants to accelerate root cause analysis and incident resolution.
Collaborate & Innovate: Participate in the planning and delivery of technical projects, defining how infrastructure evolves to support the next wave of generative AI features.
Requirements
Cloud Native & AI Curious: Brings hands-on experience with cloud infrastructure (GCP or AWS) and IaC tools like Terraform, with an interest in LLMs, RAG, or agents in an operational context.
Systems Thinker: Understands distributed systems principles and how to operate them reliably at scale, specifically addressing the challenges posed by non-deterministic AI workloads.
Polyglot Practitioner: Experienced with at least one modern programming language (e.g., TypeScript, Java, Go, Python) and comfortable navigating codebases where AI-generated PRs are the norm.
Quality & Automation: Prioritizes code quality and reliability, looking for ways to build systems that test themselves and improve through automated feedback loops.
Growth Mindset: Eager to evolve as an engineer in a landscape where the definition of "operations" changes rapidly. Familiarity with open-source projects or building "coding assistant" bots is a plus.
Graduate Reliability Engineer at GKN Aerospace enhancing operational excellence through data analysis and project participation within large structural assemblies.
Site Reliability Engineer at WRITER, ensuring 24/7 availability and performance of AI - powered workflows. Collaborating on scalable infrastructure solutions while impacting enterprise customer trust.
Engineer at Trading Technologies improving platform stability through coding and automation. Focus on building advanced monitoring tools for global trading operations.
Senior ML Ops/DevOps developing MLOps platform components at Capco Poland for financial digital transformation. Responsibilities include CI/CD, model deployment, monitoring, and team collaboration.
Senior DevOps Engineer at Verisk, focusing on AWS infrastructure and CI/CD pipeline automation. Ensuring high availability and security through collaboration with development and QA teams.
Senior DevOps & Infrastructure Engineer at IMAGO focusing on automation and infrastructure improvements. Building reliable infrastructure and leading CI/CD optimization in a dynamic environment.
DevOps Specialist creating and overseeing Azure hybrid cloud infrastructures for EVLO's battery energy storage solutions. Collaborating with teams to implement cutting - edge technologies in a dynamic environment.
Software Quality and Release Engineer developing and maintaining C++/Python software solutions for aerospace and defense industry. Collaborating on CI/CD automation and feedback documentation.
Senior DevOps Engineer building and managing big data platforms for clients in telecommunications and finance industries. Ensuring stability, scalability, and performance across cloud and on - premise environments.