Sr. Staff Production Engineer at Zscaler implementing scalable multi-cloud infrastructure. Leading automation efforts and managing incident responses within a global platform.
Responsibilities
Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
Partner with Engineering and partner teams to conduct operability reviews
Requirements
8+ years of experience managing reliability, scalability, and availability for large-scale production services
Deep expertise in programming (e.g., Python, Go, or C/C++)
Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
Experience in high-stakes incident management and participation in a 24/7 on-call rotation
Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews
Production Engineer ensuring availability of applications in distributed environments for Consort Group. Collaborating on technical projects and maintaining operational quality across services.
Site Reliability Engineer ensuring stability and security for ShiftKey’s Marketplace platform while executing AWS migration. Blends maintenance with engineering in a collaborative environment.
Production Engineer designing customer - oriented manufacturing concepts at Festo. Responsibilities include process development, documentation review, and collaboration with international teams.
Experienced Production Engineer supporting quality - critical processes and collaborating with teams to ensure high - quality pen needles. Engaging in stable operations and improvements within a 2 - year temporary contract.
Production Support Engineer ensuring system stability and reliability for Manulife's critical services. Collaborative role bridging development and infrastructure, providing seamless service for customers.
Senior Production Engineer (SRE) at Legion building and operating a secure AWS/Kubernetes platform. Focused on automation, reliability, and infrastructure as code.
Production Engineer managing database operations at Palantir, ensuring reliability and availability of data systems. Involved in architecture, design, and maintenance of production databases in various environments.
Production Engineer PCB managing first - line technical support for PCB assembly processes. Assisting with product introduction and implementing process improvements in a leading transport solutions company.
Senior Production Support / DevOps Engineer at Keyrus focusing on application reliability and cloud operations. Support enterprise Java - based platforms in collaboration with development teams.