Lead SRE Team managing two engineers and cloud infrastructure for Robin AI's Legal AI platform. Drive monitoring strategies for high availability and reliability of services while collaborating with CTO.
Responsibilities
Lead and mentor a team of two SRE Engineers, providing technical guidance and career development
Work closely with the CTO to define and implement the technical infrastructure roadmap
Establish monitoring strategies and implement solutions to enhance reliability, scalability, and cost-efficiency
Collaborate with development team leads to optimise build, test, and deployment processes
Lead incident response and establish processes for troubleshooting production issues
Organise and oversee on-call rotations to ensure 24/7 system reliability
Drive documentation standards and knowledge sharing within the engineering organisation
Requirements
5+ years of experience in DevOps or Site Reliability Engineering roles, with 2+ years in a managerial position
Proven experience managing and mentoring technical team members
Proficiency in at least one backend programming language (We use Python)
Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
Excellent leadership, communication, and problem-solving skills
Experience with AI/ML infrastructure deployment and scaling
Benefits
Generous equity scheme - everyone gets to be an owner of Robin AI!
20 days PTO, in addition to the public holidays observed in South Africa.
Growth opportunities: We prioritise promotions for high performers and help you to progress your career.
Senior Executive supporting technology initiatives in Pune, India. Collaborating globally to connect people and solve complex challenges in a sustainable manner.
DevOps Engineer leading the design, implementation, and optimisation of Kubernetes platforms for Vodafone. Collaborating with product teams to streamline operational processes and enhance developer experience.
Senior Site Reliability Engineer developing scalable systems and automation for high - scale projects at Euna Solutions. Collaborating closely with software developers and mentoring junior engineers.
Senior Site Reliability Engineer responsible for designing scalable systems at Euna Solutions. Collaborating with developers and mentoring juniors while driving automation and reliability.
Senior Site Reliability DevOps Specialist at Boeing overseeing GCP cloud environment and infrastructure. Ensuring reliability, scalability, and automation while collaborating with distributed teams.
Lead DevOps Engineer driving modernization and operational excellence for Enterprise Payments at American Family Insurance. Collaborate across teams and enhance payment processing capabilities.
Senior DevOps Engineer at Fidelity leading operational excellence of production reporting applications. Responsible for stability, reliability, and cloud migration initiatives in a hybrid work environment.
Senior Site Reliability DevOps Specialist for Boeing, focusing on cloud technology and automation in GCP environments. Collaborate globally to enhance system reliability and performance with a diverse tech stack.
SRE Team Lead in charge of reliability strategy and operational maturity for a cybersecurity SaaS platform. Leading a specialized team to enhance system performance and incident management.
Junior DevOps Engineer implementing continuous integration and deployment architecture for the Defense Logistics Agency. Debugging cluster - based computing while using various configuration management tools.