Lead SRE Team managing two engineers and cloud infrastructure for Robin AI's Legal AI platform. Drive monitoring strategies for high availability and reliability of services while collaborating with CTO.
Responsibilities
Lead and mentor a team of two SRE Engineers, providing technical guidance and career development
Work closely with the CTO to define and implement the technical infrastructure roadmap
Establish monitoring strategies and implement solutions to enhance reliability, scalability, and cost-efficiency
Collaborate with development team leads to optimise build, test, and deployment processes
Lead incident response and establish processes for troubleshooting production issues
Organise and oversee on-call rotations to ensure 24/7 system reliability
Drive documentation standards and knowledge sharing within the engineering organisation
Requirements
5+ years of experience in DevOps or Site Reliability Engineering roles, with 2+ years in a managerial position
Proven experience managing and mentoring technical team members
Proficiency in at least one backend programming language (We use Python)
Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
Excellent leadership, communication, and problem-solving skills
Experience with AI/ML infrastructure deployment and scaling
Benefits
Generous equity scheme - everyone gets to be an owner of Robin AI!
20 days PTO, in addition to the public holidays observed in South Africa.
Growth opportunities: We prioritise promotions for high performers and help you to progress your career.
Site Reliability Engineer enhancing platform reliability for AI workflows at WRITER. Overseeing automated solutions and cloud infrastructure supporting high - trafficked AI systems.
Site reliability engineer ensuring 24/7 availability of AI - powered workflows at WRITER. Developing and automating robust platforms for high - traffic AI demands.
Site Reliability Engineer maintaining cloud infrastructure for Tricentis SaaS Products. Collaborating closely with engineers, focusing on observability and performance.
Dev Ops Engineer at DATAGROUP in Rostock managing IT applications and cloud technologies. Collaborating with teams to support client IT transformations in a flexible work environment.
SRE Technical Manager leading reliability engineering teams ensuring performance for Navy IT services. Manage teams, collaborate on automation, and drive continuous improvement in a critical systems environment.
DevOps Engineer responsible for optimizing and securing cloud deployment processes at Axi. Collaborating across technology teams to promote best practices in DevOps methodologies.
Azure Cloud Engineer ensuring safe and scalable cloud environment at Schoologica while contributing to innovative educational solutions with modern cloud technologies.
DevSecOps Engineer responsible for enhancing Thales' secure hosting platforms in public and private clouds. Collaborating with teams to apply modern practices and build resilient infrastructures.