Site Reliability Engineer at Salesforce maintaining high uptime for cloud services. Collaborating with teams to automate issue resolution and improve operational excellence.
Responsibilities
Ensure 99.99% uptime for customer-facing services by proactively monitoring and maintaining the health of supporting systems, contributing directly to customer satisfaction and trust.
Act in key support roles during major incidents (e.g., Sev0, Sev1) and participate in technical incident reviews for problem management.
Contribute to Problem Management by populating and participating in Root Cause Analyses (RCAs) and handing them off to the Global Solutions team.
Ensure all work carried out by the Site Reliability team aligns with the company’s internal compliance policies and directives.
Collaborate with technical staff to solve complex technical issues and customer concerns.
Lead and mentor other team members in staying abreast of industry innovations and technologies, and assist in team development growth.
Thrive in a fast-paced environment, solving sophisticated issues quickly and successfully balancing multiple priorities.
Automate the detection and resolution of recurring issues in the production environment.
Help create and improve current processes to reduce operational and engineering toil, including the implementation of AI-driven automation for routine tasks.
Requirements
Citizenship: U.S. citizen (U.S. born or naturalized) who does not hold dual citizenship.
Education: Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field.
Experience: Systems engineering experience in enterprise-scale internet service engineering or support role.
Technical Skills: Expertise in TCP/IP related technologies (networking protocols, network programming, etc.).
Expertise in CLI enterprise support of Unix variants (Linux/Solaris/BSD), with significant exposure to Red Hat Enterprise Linux and Solaris.
Strong understanding of monitoring security systems and administration.
Experience provisioning, operating, and running AWS/C2S based infrastructure and systems.
Proficiency in scripting with Python, Go, or other languages.
Communication: Strong written and oral communication skills.
Incident Management: Past experience in Incident Management and a good understanding of ITIL service operations.
Availability: Ability to participate in a 24/7 on-call rotation supporting large data center operations and be available for shift work.
Benefits
time off programs
medical
dental
vision
mental health support
paid parental leave
life and disability insurance
401(k)
employee stock purchasing program
Job title
Site Reliability Engineer, GovCloud Incident Response
Senior DevOps Engineer developing core infrastructure supporting Shelf products. Focused on building reliable, secure, and scalable systems in hybrid work environment.
Cloud/Kubernetes Engineer supporting hybrid infrastructure across AWS and on - premise Kubernetes environments. Automating tasks and managing production reliability, security, and scalability.
AWS Infrastructure DevOps Engineer at Growth Acceleration Partners supporting AWS environments and infrastructure automation. Focused on reliability, security, and operational efficiency across production environments.
Site Reliability Engineer driving innovation and automation for Banking Solutions and Payments. Collaborating with teams to ensure application performance and reliability in a dynamic environment.
Mainframe SRE working on critical payment systems for fintech, ensuring stability and security. Collaborating with teams to perform root cause analysis and automate processes.
DevOps Engineer responsible for cloud product delivery, platform reliability, and using AI tools in DevOps workflows. Building CI/CD pipelines and optimizing container workloads for security and performance.
Senior DevOps Engineer for Paysafe, designing and deploying AWS applications and infrastructure. Collaborating on cloud environments and improving processes for scalable solutions.
Senior Site Reliability Engineer at Broadridge managing infrastructure design and operational support. Collaborating with teams to improve automation, performance, and reliability of services in a hybrid environment.
DevSecOps Engineer building and maintaining Azure DevOps cloud applications with API backend. Roles include developing CI/CD pipeline and automating backend tasks.
Reliability Engineer II at Cargill applying technical expertise to enhance process and asset reliability. Collaborating with teams to execute engineering strategies for equipment optimization in a salt mine setting.