Senior Site Reliability Engineer improving software performance and technical operations for a workflow builder startup. Collaborating with teams on infrastructure, scalability, and developer experience.
Responsibilities
Monitoring our core business-logic software, both via on-call and in non-urgent situations: describing its existing behavior and defining SLOs or SLAs that get us (you and we) to respond.
Extending and monitoring our infrastructure stack.
Maintaining both a mental and a reified model of our systems: from this model, estimating risk, planning projects, and debugging efficiently.
Collaborating with our engineering team and company leadership from your unique lens on site reliability.
Working on our core orchestration logic that determines how to efficiently run tens of thousands of workflows at the same time.
Advising many parallel major backend engineering projects, both early in planning and through release.
Optimizing our services for scalability, stability, and observability as our customer base grows and our product becomes more sophisticated.
Improving our developer experience in tactical ways, and improving our overall engineering processes and practices more broadly.
Requirements
5+ years of SRE, DevOps, or Platform engineering experience
A proven record of building efficient, performant, and easy to extend systems.
Has maintained quantitative metrics of site reliability, while also demonstrating judgment about appropriate strictness for SLOs & SLAs.
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes) and how they interact with backend services, and with Linux
Experience implementing and managing AWS infrastructure
You’re not afraid to ask for help, and you’re happy to give it, too.
You’re an enthusiastic communicator and you like working with a team that provides both mutual support and thoughtful critique.
You're excited to join a hybrid team and work out of our NYC or SF office ~3 days a week.
Senior Reliability Engineer at Sonova ensuring dependable performance of hearing solutions for millions of users globally. Involves engineering skills to improve product reliability across development stages.
Equipment and Reliability Engineer at Chobani responsible for improving asset efficiency, redesigning equipment. Collaborating with Operations to solve complex problems and lead projects in a team environment.
Reliability Engineer II focused on enhancing safety, efficiencies, and cost controls at Freeport - McMoRan mining operations. Collaborating with multiple teams and managing engineering projects.
Reliability Engineer I responsible for equipment failure analysis and improvement recommendations at Freeport - McMoRan's copper smelting operations. Ensuring uninterrupted production and managing equipment health through data analysis.
Designing, building, and maintaining the Kubernetes - based developer platform for Schwarz IT Barcelona. Collaborating with engineering teams to enhance services in Azure and Google Cloud.
Database Reliability Engineer managing MySQL database infrastructure at PointClickCare. Collaborating with Engineering and SRE teams for product development and reliable integration across the platform.
Teamleitung in der Gebäudereinigung in Grimma, verantwortliche Planung, Organisation und Führung des Reinigungsteams. Aktive Mitarbeit und Einhaltung von Hygiene - und Qualitätsstandards sind erforderlich.
Service Reliability Engineer providing technical support and managing incidents for BT International. Ensuring system availability and collaboration with global stakeholders to achieve objectives.
Studying Bachelor of Arts in Accounting, Taxation, and Economic Law while gaining practical experience in a dynamic team. Benefit from a diverse working day and continuous development opportunities.
Technical Trainer conducting workshops and training sessions on MERKUR Group's product content for diverse audiences. Engaging with employees and clients to ensure smooth product operation and understanding.