Site Reliability Engineer ensuring reliability and scalability of TEG's global live entertainment platforms. Collaborating with teams to enhance system reliability and prevent outages across ticketing platforms.
Responsibilities
Proactively guard the health, availability, and performance of TEG's critical global production systems.
Engineer and automate robust monitoring and auto-healing solutions to proactively prevent outages and meet service level objectives (SLOs).
Drive Infrastructure-as-Code (IaC) principles for provisioning and deploying our highly available, scalable platforms.
Lead critical incident response efforts, ensuring rapid resolution and restoration of platform stability
Provide technical leadership during major incidents, focusing on swift problem analysis and effective communication to stakeholders.
Transform incidents into progress by conducting deep post-mortems and driving the implementation of strategic preventative measures across various teams.
Build and maintain high-performing, fault-tolerant distributed systems emphasizing resiliency and efficiency.
Elevate operational maturity by continuously improving processes, tooling, and efficiency across the department.
Champion operational excellence and shared responsibility, collaborating with development and other teams to improve processes and tools.
Innovate system design by evaluating and integrating new technologies to enhance reliability, scalability, and security.
Mentor and coach colleagues, elevating the overall reliability engineering capability and maturity of the Technology department
Requirements
Mastery of highly available, fault-tolerant AWS system design and management.
Strong foundation in AWS networking (VPC, Route 53) and security best practices.
Proficiency in key scripting languages (Python, Bash, PowerShell) for automation.
Proven ability to perform effectively under pressure, managing high-volume tasks and meeting tight deadlines
Minimum of 3 years of prior SRE or DevOps experience.
Expert knowledge of fundamental infrastructure concepts (Networking, Containerisation, Virtualisation, DNS)
Working familiarity with key CI/CD and Infrastructure-as-Code tools (e.g., Terraform, Ansible, Jenkins)
Principal Site Reliability Engineer at Early Warning designing performance and resiliency patterns for applications and infrastructure. Collaborating with development teams to improve systems and data integrity.
DevOps Engineer contributing to CI/CD setup and Azure services management. Collaborates with teams to ensure efficient project delivery in a hybrid environment.
IT DevOps Specialist at BMW responsible for analyzing requirements and implementing software solutions in AWS cloud environments. Collaborating internationally within agile teams for digital transformation projects.
DevOps Engineer at Vistra designing, implementing, and maintaining robust CI/CD pipelines and cloud infrastructure. Enabling software delivery across multiple technology stacks with a focus on AWS.
Manage complex customer rollouts and initial system deployments at Talex.ai. Bridging technical development with real - world application in robotics and AI systems.
Cloud Operations Engineer designing and implementing highly reliable cloud solutions. Leading cloud infrastructure initiatives for production operations and customer success in a growing team.
Quality Engineer supporting new product launches and reliability testing for SSD at Micron in Malaysia. Responsible for coordinating test activities and conducting failure analysis.
Reliability Engineer ensuring operational readiness of data centers at Rowan Digital Infrastructure. Overseeing commissioning, operational standards, and transitioning facilities into live operations.
Manager of Mechanical Engineering ensuring high - availability mechanical systems in data centers. Collaborating on lifecycle management and performance evaluation across missions - critical facilities in a hybrid role.
DevOps Engineer developing reusable Ansible and Puppet modules and managing CI/CD for project teams. Join PLATH in Hamburg, focusing on crisis detection software development.