Hybrid Site Reliability Engineer

Posted 2 hours ago

Apply now

About the role

  • Site Reliability Engineer contributing to platform reliability at Trainline, Europe's leading rail ticketing platform. Collaborating with product engineering to ensure operational readiness and incident response.

Responsibilities

  • Developing an understanding of system architecture, dependencies, and failure modes across the Trainline platform
  • Participating in production incident response, supporting investigation, mitigation, communication, and coordinated service restoration
  • Contributing to post-incident reviews and follow-up actions to improve reliability, scalability, and resilience
  • Taking part in the SRE on-call rotation
  • Designing, building, and maintaining observability using metrics, logs, events, and traces to support effective detection and diagnosis
  • Improving monitoring and alerting by aligning signals to business and customer impact, reducing noise and improving mean time to detection (MTTD)
  • Ensuring relevant operational data is surfaced quickly and clearly during live incidents
  • Making informed tooling and technology choices using SRE principles, balancing team and business needs
  • Supporting AWS-hosted infrastructure and shared platform services using infrastructure-as-code and CI/CD tooling
  • Collaborating with product engineering teams to ensure services are operationally ready and deployed safely
  • Advising on reliability and resilience practices
  • Writing and maintaining reliable, well-structured code and scripts to support reliability and observability goals
  • Prioritising work effectively and collaborating using agile processes to deliver against team and business goals

Requirements

  • Experience of SRE concepts such as SLI, SLO and error budgets.
  • Hands-on experience with observability tooling such as New Relic, Elastic (ELK Stack), Influx, Grafana or similar
  • Experience working with cloud providers (preferably AWS).
  • Experience troubleshooting Linux operating systems.
  • Experience of scripting in at least one language (preferably Python)
  • Understanding of load balancing and reverse proxy concepts, upstream config concepts, upstream health checks, worker & data flow concepts.
  • Application architecture concepts (threading, queuing, readiness checks, health checks, circuit breakers, timeouts, exponential backoff, throttling).
  • Experience building, maintaining and evolving time series data, retention, cardinality, deviation, moving averages and other functions.
  • Experience with build, deployment & configuration management tooling such as GitHub Actions and Terraform.

Benefits

  • private healthcare & dental insurance
  • generous work from abroad policy
  • 2-for-1 share purchase plans
  • EV Scheme to reduce carbon emissions
  • extra festive time off
  • excellent family-friendly benefits
  • clear career paths
  • transparent pay bands
  • personal learning budgets
  • regular learning days

Job title

Site Reliability Engineer

Job type

Experience level

JuniorMid level

Salary

£55,000 - £63,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job