Site Reliability Engineer contributing to platform reliability at Trainline, Europe's leading rail ticketing platform. Collaborating with product engineering to ensure operational readiness and incident response.
Responsibilities
Developing an understanding of system architecture, dependencies, and failure modes across the Trainline platform
Participating in production incident response, supporting investigation, mitigation, communication, and coordinated service restoration
Contributing to post-incident reviews and follow-up actions to improve reliability, scalability, and resilience
Taking part in the SRE on-call rotation
Designing, building, and maintaining observability using metrics, logs, events, and traces to support effective detection and diagnosis
Improving monitoring and alerting by aligning signals to business and customer impact, reducing noise and improving mean time to detection (MTTD)
Ensuring relevant operational data is surfaced quickly and clearly during live incidents
Making informed tooling and technology choices using SRE principles, balancing team and business needs
Supporting AWS-hosted infrastructure and shared platform services using infrastructure-as-code and CI/CD tooling
Collaborating with product engineering teams to ensure services are operationally ready and deployed safely
Advising on reliability and resilience practices
Writing and maintaining reliable, well-structured code and scripts to support reliability and observability goals
Prioritising work effectively and collaborating using agile processes to deliver against team and business goals
Requirements
Experience of SRE concepts such as SLI, SLO and error budgets.
Hands-on experience with observability tooling such as New Relic, Elastic (ELK Stack), Influx, Grafana or similar
Experience working with cloud providers (preferably AWS).
Experience troubleshooting Linux operating systems.
Experience of scripting in at least one language (preferably Python)
Understanding of load balancing and reverse proxy concepts, upstream config concepts, upstream health checks, worker & data flow concepts.
DevOps Engineer III providing L3 support for Operations across Edge/on - prem and cloud environments. Building automations and handling incidents for customer deployments.
SRE leading reliability and operational excellence at a mortgage tech platform. Designing systems, tooling, and processes for managing Pylon's production systems in Palo Alto.
Senior Build & Release Engineer at GXO Logistics responsible for CI/CD solutions and build automation across various environments. Collaborating with teams for smooth software deployments and mentoring staff.
Senior Site Reliability Engineer improving the reliability of Acuity’s cloud services. Collaborating across teams to define observability standards and incident response in Cork Digital Centre of Excellence.
Azure Senior DevOps Engineer supporting critical cloud systems in the Azure Government Cloud environment. Leading CI/CD pipeline design and implementation with operational best practices.
Automation Engineer enhancing infrastructure and automating operations for client systems. Working in a complex environment oriented towards automation, security, and performance.
Graduate Reliability Engineer at GKN Aerospace enhancing operational excellence through data analysis and project participation within large structural assemblies.
Site Reliability Engineer at WRITER, ensuring 24/7 availability and performance of AI - powered workflows. Collaborating on scalable infrastructure solutions while impacting enterprise customer trust.
Engineer at Trading Technologies improving platform stability through coding and automation. Focus on building advanced monitoring tools for global trading operations.