Site Reliability Engineer joining Spotify’s Backstage team, building intelligent infrastructure for the world's most popular audio streaming service. Contributing to AI-native workflows and developer experience.
Responsibilities
Orchestrate the Fleet: Maintain and improve Portal’s SaaS infrastructure for reliability, security, and scalability. This covers the runtime environments supporting the platform and workflows powered by large language models.
Modern Infra-as-Code: Collaborate with senior engineers to build infrastructure on GCP and AWS using Terraform and emerging infrastructure-from-code patterns where agents assist in defining the stack.
Support Fullstack Systems: Operate in a modern web stack environment (TypeScript, React, Python). While this isn’t a frontend-heavy role, comfort with debugging fullstack systems and web infrastructure is key.
Reliability Engineering: Participate in on-call rotations to ensure systems meet reliability and availability goals, employing AI assistants to accelerate root cause analysis and incident resolution.
Collaborate & Innovate: Participate in the planning and delivery of technical projects, defining how infrastructure evolves to support the next wave of generative AI features.
Requirements
Cloud Native & AI Curious: Brings hands-on experience with cloud infrastructure (GCP or AWS) and IaC tools like Terraform, with an interest in LLMs, RAG, or agents in an operational context.
Systems Thinker: Understands distributed systems principles and how to operate them reliably at scale, specifically addressing the challenges posed by non-deterministic AI workloads.
Polyglot Practitioner: Experienced with at least one modern programming language (e.g., TypeScript, Java, Go, Python) and comfortable navigating codebases where AI-generated PRs are the norm.
Quality & Automation: Prioritizes code quality and reliability, looking for ways to build systems that test themselves and improve through automated feedback loops.
Growth Mindset: Eager to evolve as an engineer in a landscape where the definition of "operations" changes rapidly. Familiarity with open-source projects or building "coding assistant" bots is a plus.
Site Reliability Engineer at Swiss Re designing and improving observability platforms. Involves collaboration with IT for a seamless customer experience and system reliability.
Lead DevSecOps Engineer responsible for secure cloud infrastructure at Swiss Re. Enhancing DevSecOps practices and collaborating within an agile environment.
DevOps Engineer responsible for designing and maintaining CI/CD pipelines at LUZA Group. Collaborating with teams on infrastructure automation using Terraform and Ansible.
Design and engineer wire harnesses for vehicles at Ford, ensuring quality and on - time delivery of components. Collaborate with engineering teams and suppliers to innovate and optimize designs.
Cloud Engineer joining Technology Operations at Wells Fargo, focusing on intelligent infrastructure solutions and Kubernetes platform optimization. Responsible for cloud - native deployments and AI - driven operations.
Principal DevOps Engineer/SRE leading DevOps initiatives for multi - tenant SaaS platform. Designing standards and automating to empower product teams in operations and deployment.
Lead DevSecOps Engineer at McKesson driving cloud infrastructure and security initiatives. Focusing on GitHub workflows and Azure, mentoring team members on best practices.
DevOps Manager leading a team of engineers at FleetPartners, enhancing automation and overseeing cloud infrastructure. Working in a hybrid role to deliver optimized services and operational excellence.
DevOps Engineer supporting AT&T's Salesforce - based Business Sales application. Involved in CI/CD automation, cloud - native engineering, and collaboration across teams.
SRE Senior Engineer ensuring the reliability of large - scale distributed systems at Beyond Soluções. Overseeing data platform SLIs and SLOs while implementing automation and advanced observability.