CentralReach seeks a Sr. Site Reliability Engineer to improve the reliability of cloud platforms. Collaborate with software teams, implement modern reliability practices and lead infrastructure improvements.
Responsibilities
Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's.
Manage site stability, performance, reliability, and maintain uptime for production environments.
Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns.
Strive for automation to reduce toil and increase development velocity.
Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
Document resolution run books and standard operating procedures.
Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams.
Implementation of reliability and observability tools (like New Relic, Prometheus, Grafana etc.,)
Collaborates with Security team and other platform engineering teams to build reliable, maintainable, and scalable solutions that improve our security posture.
Requirements
Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.)
Experience implementing observability plans around logs, metrics, and traces.
Experience in an agile development team developing software.
Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation).
Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef.
Strong experience with containerization technology and/or Kubernetes.
Experience with Release automation, system administration, configuration management.
Experience with programming languages (Java, Python, Go, etc.).
Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
Strong analytical and programming skills (Python, Go, Java etc.).
Deep understanding around best practices for modern cloud security.
Proven experience building observability for security concerns, such as privilege escalations and bot detection.
Reliability & Process Engineer at Orora optimizing equipment reliability and leading continuous improvement initiatives. Supporting the beverage industry with sustainable packaging solutions in Australia.
Cloud DevOps Engineer II developing cloud - based solutions for defense technology company EXPANSIA. Leading integration methodologies and implementing effective strategies for system performance improvement.
Lead Software Engineer at Wells Fargo managing technical initiatives and collaborating across teams. Develop and maintain scalable solutions using Python and various technologies in a financial services environment.
Systems Operations Engineer in Wells Fargo's Global Payments & Liquidity team managing complex technical issues and providing process improvement strategies.
Senior Software Engineer for Planet working on GMS applications and DevOps practices for satellite data solutions. Managing application lifecycles and ensuring operational excellence in cloud environments.
Senior Software Engineer focusing on DevOps, managing the application lifecycle and CI/CD for GMS. Collaborating with teams to ensure secure, reliable releases with a customer - centric approach.
Chassis Controls Software Engineer developing applications for sophisticated systems at Ford. Involves software delivery and calibration management with supplier collaboration in hybrid work setting.
Business Intelligence Developer creating and maintaining Power BI solutions for strategic decision - making. Collaborating with teams to develop scalable BI assets and optimize data reporting.
Drive design and delivery of scalable and secure AWS cloud infrastructure at Gartner. Lead automation and cloud strategy, ensuring operational excellence and mentoring junior engineers.
DevOps Engineer responsible for stable operations of infrastructure and software lifecycle in Collection Process Operations. Involvement in modernizing systems and continuous process automation.