Site Reliability Engineer contributing to platform reliability at Trainline, Europe's leading rail ticketing platform. Collaborating with product engineering to ensure operational readiness and incident response.
Responsibilities
Developing an understanding of system architecture, dependencies, and failure modes across the Trainline platform
Participating in production incident response, supporting investigation, mitigation, communication, and coordinated service restoration
Contributing to post-incident reviews and follow-up actions to improve reliability, scalability, and resilience
Taking part in the SRE on-call rotation
Designing, building, and maintaining observability using metrics, logs, events, and traces to support effective detection and diagnosis
Improving monitoring and alerting by aligning signals to business and customer impact, reducing noise and improving mean time to detection (MTTD)
Ensuring relevant operational data is surfaced quickly and clearly during live incidents
Making informed tooling and technology choices using SRE principles, balancing team and business needs
Supporting AWS-hosted infrastructure and shared platform services using infrastructure-as-code and CI/CD tooling
Collaborating with product engineering teams to ensure services are operationally ready and deployed safely
Advising on reliability and resilience practices
Writing and maintaining reliable, well-structured code and scripts to support reliability and observability goals
Prioritising work effectively and collaborating using agile processes to deliver against team and business goals
Requirements
Experience of SRE concepts such as SLI, SLO and error budgets.
Hands-on experience with observability tooling such as New Relic, Elastic (ELK Stack), Influx, Grafana or similar
Experience working with cloud providers (preferably AWS).
Experience troubleshooting Linux operating systems.
Experience of scripting in at least one language (preferably Python)
Understanding of load balancing and reverse proxy concepts, upstream config concepts, upstream health checks, worker & data flow concepts.
DevOps Engineer improving reliability and stability of cloud services at Madhive. Responsibilities include CI/CD tooling, monitoring, and cloud infrastructure management.
Senior DevOps Analyst at Stefanini managing Azure DevOps for build and deploy automation. Collaborating with development squads and ensuring code quality with validation tools.
Senior DevOps Engineer leading design and management of CI/CD pipelines at Neuron7.ai. Collaborating on cloud infrastructure for scalable applications in an innovative tech environment.
Backend Software Engineer responsible for building robust backend systems for AI and analytics products. Collaborating with various teams to enhance platform reliability and performance.
Senior DevOps Engineer responsible for cloud ecosystem architecture at health - tech startup. Building HIPAA/GDPR - compliant foundations and mentoring developers.
Senior Backend Engineer building product features and maintaining infrastructure for insurance platform. Employing tools like Terraform, Kafka, Datadog and Qovery with a strong DevOps focus.
DevOps Systems Engineer supporting customer operations in Annapolis Junction, MD. Responsible for creating, sustaining, and troubleshooting complex operational data flows.
OpenShift Fresher assisting Cloud team in managing containerized applications using Red Hat OpenShift. Supporting CI/CD, deployment automation, and cloud - native application environments.
Site Reliability Engineer for Leidos ensuring reliability, performance, and scalability of complex distributed systems for the Navy - Marine Corps Intranet. Collaborating with teams to maintain and optimize network operations and services.
DevOps Engineer evolving banking infrastructure for a fintech company. Focusing on observability, incident response, and platform automation in a hybrid work setup.