Founding Staff Software Engineer supporting site reliability and infrastructure at Character.AI. Collaborating with development team to ensure product reliability and scalability while growing user base.
Responsibilities
Maintain production services and keep them operational.
Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.
Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.
Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.
Establish and support SLAs and SLOs for our site
Provide system monitoring and incident alerts
Participate in on-call rotations to provide support for critical incidents and outages.
Develop plans for site reliability and disaster recovery
Requirements
5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale
Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang
Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base.
Experience working with multiple cloud computing platforms such as GCP is also a must
Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems
Experience with incident management and event postmortems
Outstanding candidates will have one or more of the following:
Familiarity with GPU clusters and/or HPC environments is preferred
Experience with monitoring and logging tools such as Prometheus and Grafana
Hands-on experience scaling a consumer product from early days into hypergrowth
Benefits
🩺 Top-notch health coverage for you & your family, with majority of the premium covered
💰 We invest in your future with a generous 401(K) contribution
🍼 New parents, we've got you covered with incredible paid leave -up to 20 weeks
🌴 4 weeks of PTO to explore, unwind & come back recharged
🍽️ Daily in-office catering plus a monthly Doordash stipend to help keep you fueled no matter where you are**
✨ Monthly wellness stipend to support you in your health journey
Manager of Mechanical Engineering ensuring high - availability mechanical systems in data centers. Collaborating on lifecycle management and performance evaluation across missions - critical facilities in a hybrid role.
Reliability Engineer ensuring operational readiness of data centers at Rowan Digital Infrastructure. Overseeing commissioning, operational standards, and transitioning facilities into live operations.
DevOps Engineer developing reusable Ansible and Puppet modules and managing CI/CD for project teams. Join PLATH in Hamburg, focusing on crisis detection software development.
Senior DevOps Engineer designing and maintaining CI/CD pipelines for a leading connectivity firm. Collaborating with cross - functional teams to optimize cloud infrastructure and enhance operational excellence.
Mechanical Reliability Engineer at Cargill ensuring asset reliability through advanced maintenance practices. Collaborating with teams and overseeing projects in heavy industrial processes.
Sr. DevOps Engineer at AllTrails focused on enhancing infrastructure reliability and security. Collaborating with engineering teams and contributing to projects for system optimization.
Senior IT Analyst focusing on SRE for Itaú, the largest bank in Latin America. Ensuring reliability and performance of critical systems through automation and incident resolution.
Site Reliability Engineer focusing on building scalable systems and maintaining high service uptime at Trade Nation. Collaborating with developers and product teams at a global trading firm.