Lead SRE team to design and operate scalable, secure cloud infrastructure for Instabase's AI platform. Manage CI/CD, Kubernetes, production reliability, and release processes.
Responsibilities
Define and steer the technical direction for the team, collaborating with cross-functional partners
Develop and execute comprehensive short and long-term roadmaps balancing business needs, user experience, and technical foundations
Oversee cloud infrastructure and deployment automation to ensure efficient and reliable operations
Guarantee uptime and reliability for production systems through proactive monitoring and production support
Manage vulnerability assessments and facilitate prompt remediation
Maintain and enhance CI/CD and build infrastructure to support development workflows
Implement and optimize tools to enhance developer productivity
Drive improvements in release management processes and tooling to ensure smooth, reliable software delivery
Build scalable, distributed, and fault-tolerant systems integrating Software and Systems Engineering to drive performance, capacity, and reliability
Requirements
5+ years of experience in Site Reliability Engineering, Software Engineering, or Production Engineering
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
Proven track record of setting technical and cultural standards for engineering teams
Demonstrated experience in managing and sustaining SaaS production environments
Hands-on experience with major cloud providers such as AWS and Azure
Proficient in containerization technologies like Docker
Expertise in container orchestration platforms, especially Kubernetes
Skilled in overseeing and managing software release processes to ensure smooth deployments
Systematic approach to solving platform and production issues, strong problem-solving abilities, and a passion for automation
Benefits
Bonus
Equity
Benefits
Hybrid work
Offices in San Francisco, New York, London and Bengaluru
Site Reliability Engineer enhancing platform reliability for AI workflows at WRITER. Overseeing automated solutions and cloud infrastructure supporting high - trafficked AI systems.
Site reliability engineer ensuring 24/7 availability of AI - powered workflows at WRITER. Developing and automating robust platforms for high - traffic AI demands.
Site Reliability Engineer maintaining cloud infrastructure for Tricentis SaaS Products. Collaborating closely with engineers, focusing on observability and performance.
Dev Ops Engineer at DATAGROUP in Rostock managing IT applications and cloud technologies. Collaborating with teams to support client IT transformations in a flexible work environment.
SRE Technical Manager leading reliability engineering teams ensuring performance for Navy IT services. Manage teams, collaborate on automation, and drive continuous improvement in a critical systems environment.
DevOps Engineer responsible for optimizing and securing cloud deployment processes at Axi. Collaborating across technology teams to promote best practices in DevOps methodologies.
Azure Cloud Engineer ensuring safe and scalable cloud environment at Schoologica while contributing to innovative educational solutions with modern cloud technologies.
DevSecOps Engineer responsible for enhancing Thales' secure hosting platforms in public and private clouds. Collaborating with teams to apply modern practices and build resilient infrastructures.