Site Reliability Engineer at PayPal ensuring high availability, performance, and scalability for critical systems. Leading reliability initiatives, incident management, and system performance optimization.
Responsibilities
Take ownership of system performance monitoring, identify inefficiencies, and lead initiatives to improve the overall availability and reliability of digital platforms and applications
Lead and manage the response to complex, high-priority incidents, ensuring prompt resolution and a thorough root cause analysis to prevent future occurrences
Design and implement advanced automation frameworks to improve operational efficiency, streamline processes, and reduce human error
Lead reliability-focused initiatives, ensuring systems are highly available, resilient, and scalable, and promote best practices across engineering teams
Enhance the monitoring infrastructure by identifying key metrics, optimizing alerting, and improving system observability to ensure the reliability of large-scale systems
Forecast resource requirements and lead capacity planning activities to ensure systems can scale effectively to meet growing user demand
Ensure robust disaster recovery strategies are in place and conduct regular testing to ensure systems can recover quickly from failures
Partner with engineering and product teams to identify opportunities for improving system architecture, focusing on scalability, reliability, and fault tolerance
Provide mentorship and technical guidance to junior site reliability engineers, fostering skill development and knowledge sharing
Drive continuous improvement across operational workflows, identifying areas for optimization, cost reduction, and performance enhancement.
Requirements
3+ years in Cloud Infrastructure, Site Reliability Engineering (SRE), DevOps Engineering, or related fields
B.S. or M.S. degree in Computer Science, Engineering, or a related technical field, or equivalent experience may be considered in lieu of degree
At least 2+ years of hands-on experience deploying, managing, and optimizing containerized applications using GKE, and Harness in both public and private cloud environments (AWS, GCP, Azure, etc.), preferably Google Cloud Platform (GCP)
2+ years of hands-on experience with Infrastructure-as-code (Terraform, CloudFormation), CI/CD pipelines (CircleCI, Harness, Jenkins, ArgoCD), and experience in Node, Python, or Go
Strong understanding of using Google Cloud Logging, DataDog, or other monitoring and observability tools
Ability to effectively diagnose and resolve performance bottlenecks within GCP at the infrastructure and application layers
Strong leadership abilities; must have customer focus and commitment to quality
Must have great interpersonal skills; solid communication skills, written and verbal
Ability to remain composed, methodical, and think fast in a high-pressure environment
Experience in managing, collaborating, and influencing global teams
Must be organized, detail-oriented, and able to manage multiple tasks simultaneously with the ability to appropriately prioritize.
Site Reliability Engineer Intern at Tencent working on gaming services and cloud native solutions. Collaborating with global teams to eliminate toil and enhance reliability.
Cloud/DevOps Specialist at N5X managing and optimizing critical cloud infrastructures for Brazilian energy trading. Collaborating with a multidisciplinary team to ensure high availability and performance.
Cloud/Devops Specialist responsible for designing a hybrid architecture combining cloud and on - premises infrastructure for energy trading systems. Collaborating with a multidisciplinary team in a dynamic environment.
Reliability Engineering Specialist utilizing reliability tools and models to improve asset performance at Enbridge. Collaborating across teams to guide investment decisions for safe operations.
DevOps Engineer responsible for structuring and supporting cloud DevOps architecture in Brazil. Working strategically on automation and CI/CD practices with development teams in Pernambuco.
DevSecOps Software Engineer developing secure CI/CD pipelines for Boeing's military software systems. Collaborate with cross - functional teams and implement automation and security best practices.
DevOps Manager responsible for managing a team for multi - cloud solutions supporting the USAF Cloud One project. Focus on scalable cloud - native solutions and CI/CD practices.
Lead Site Reliability Engineer overseeing SRE practices across Azure and GCP platforms. Driving reliability improvements and leading a team at Lloyds Banking Group.
DevOps Engineer responsible for managing Microsoft Intune operations at Bundesdruckerei GmbH. Focused on ensuring secure digital solutions for identity and data protection in Berlin.
Senior Site Reliability Engineer driving observability and reliability for business - critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.