Lead SRE Team managing two engineers and cloud infrastructure for Robin AI's Legal AI platform. Drive monitoring strategies for high availability and reliability of services while collaborating with CTO.
Responsibilities
Lead and mentor a team of two SRE Engineers, providing technical guidance and career development
Work closely with the CTO to define and implement the technical infrastructure roadmap
Establish monitoring strategies and implement solutions to enhance reliability, scalability, and cost-efficiency
Collaborate with development team leads to optimise build, test, and deployment processes
Lead incident response and establish processes for troubleshooting production issues
Organise and oversee on-call rotations to ensure 24/7 system reliability
Drive documentation standards and knowledge sharing within the engineering organisation
Requirements
5+ years of experience in DevOps or Site Reliability Engineering roles, with 2+ years in a managerial position
Proven experience managing and mentoring technical team members
Proficiency in at least one backend programming language (We use Python)
Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
Excellent leadership, communication, and problem-solving skills
Experience with AI/ML infrastructure deployment and scaling
Benefits
Generous equity scheme - everyone gets to be an owner of Robin AI!
20 days PTO, in addition to the public holidays observed in South Africa.
Growth opportunities: We prioritise promotions for high performers and help you to progress your career.
Site Reliability Engineer improving reliability and availability of Forcepoint products through automation and operational efficiency. Engaging in incident response and collaborating with development teams.
DevOps Engineer responsible for internal tooling and API development to enhance deployment and operational efficiency at Genesys Cloud. Build automation to improve service health and scalability.
Site Reliability Engineer focused on designing and maintaining observability solutions for fintech company. Collaborating across teams and automating infrastructure for global payment processing.
Azure Security Engineer working on cloud - based security strategies and implementations for Global Payments. Collaborating with teams to enhance the security posture and mitigate risks.
Release Engineer at Air Apps responsible for optimizing release processes and collaborating with cross - functional teams. Focused on smooth, reliable, and efficient application delivery.
DevOps Engineer responsible for maintaining and optimizing infrastructure at Tenet3. Focused on security, automation, and technical operations within a collaborative team environment.
Site Reliability Engineer II at LexisNexis Risk Solutions building Terraform modules and CI/CD pipelines. Responsible for developing cloud infrastructure and ensuring reliability, security, and observability.
DevOps Engineer supporting cloud modernization for the Department of the Air Force on the Cloud One contract. Involved in systems analysis, security practices, and collaboration with engineering teams.
Journeyman Cloud Operations Engineer maintaining cloud infrastructure across DoD organizations. Supporting DevSecOps and ensuring compliance with security requirements in a high - visibility program.
DevOps Engineer managing cloud - native platforms for Capgemini. Collaborating with development, data/ML, and security teams to deliver scalable solutions on Azure.