Senior Site Reliability Engineer at Salla leading reliability initiatives and ensuring platform performance. Handling incidents and mentoring engineers in building resilient systems.
Responsibilities
Lead reliability initiatives, handle complex incidents, improve platform performance, and guide engineering teams toward building resilient systems.
Participate in the **on-call rotation** as part of our commitment to platform reliability.
Troubleshoot complex issues across applications, infrastructure, and networks.
Identify and resolve performance bottlenecks and scaling challenges.
Enhance cloud-native infrastructure, deployment processes, and automation.
Build and refine dashboards, alerts, metrics, logs, and traces.
Develop tools that reduce operational toil and increase reliability.
Mentor engineers on reliability, debugging, and operational best practices.
Requirements
Strong experience with **Kubernetes**, **service mesh technologies**, and cloud platforms (AWS/GCP/Azure).
Deep understanding of **Linux**, networking, distributed systems, and load balancers.
Hands-on with **Terraform** or similar IaC tools.
Experience with **Prometheus**, **Grafana**, **Loki**, **Mimir**, **Elastic**, or similar observability tools.
Proficiency in scripting/programming (Bash, Python, Go).
Experience with CI/CD and GitOps.
Strong debugging, incident response, and performance analysis skills.
Quality Engineer supporting new product launches and reliability testing for SSD at Micron in Malaysia. Responsible for coordinating test activities and conducting failure analysis.
Manager of Mechanical Engineering ensuring high - availability mechanical systems in data centers. Collaborating on lifecycle management and performance evaluation across missions - critical facilities in a hybrid role.
Reliability Engineer ensuring operational readiness of data centers at Rowan Digital Infrastructure. Overseeing commissioning, operational standards, and transitioning facilities into live operations.
DevOps Engineer developing reusable Ansible and Puppet modules and managing CI/CD for project teams. Join PLATH in Hamburg, focusing on crisis detection software development.
Senior DevOps Engineer designing and maintaining CI/CD pipelines for a leading connectivity firm. Collaborating with cross - functional teams to optimize cloud infrastructure and enhance operational excellence.
Mechanical Reliability Engineer at Cargill ensuring asset reliability through advanced maintenance practices. Collaborating with teams and overseeing projects in heavy industrial processes.
Sr. DevOps Engineer at AllTrails focused on enhancing infrastructure reliability and security. Collaborating with engineering teams and contributing to projects for system optimization.
Senior IT Analyst focusing on SRE for Itaú, the largest bank in Latin America. Ensuring reliability and performance of critical systems through automation and incident resolution.