Founding Staff Software Engineer supporting site reliability and infrastructure at Character.AI. Collaborating with development team to ensure product reliability and scalability while growing user base.
Responsibilities
Maintain production services and keep them operational.
Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.
Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.
Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.
Establish and support SLAs and SLOs for our site
Provide system monitoring and incident alerts
Participate in on-call rotations to provide support for critical incidents and outages.
Develop plans for site reliability and disaster recovery
Requirements
5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale
Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang
Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base.
Experience working with multiple cloud computing platforms such as GCP is also a must
Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems
Experience with incident management and event postmortems
Outstanding candidates will have one or more of the following:
Familiarity with GPU clusters and/or HPC environments is preferred
Experience with monitoring and logging tools such as Prometheus and Grafana
Hands-on experience scaling a consumer product from early days into hypergrowth
Benefits
🩺 Top-notch health coverage for you & your family, with majority of the premium covered
💰 We invest in your future with a generous 401(K) contribution
🍼 New parents, we've got you covered with incredible paid leave -up to 20 weeks
🌴 4 weeks of PTO to explore, unwind & come back recharged
🍽️ Daily in-office catering plus a monthly Doordash stipend to help keep you fueled no matter where you are**
✨ Monthly wellness stipend to support you in your health journey
Machine Learning Engineer responsible for designing and maintaining ML infrastructure on AWS at Roche. Key role in revolutionizing drug discovery using machine learning techniques with a close - knit team.
Senior Site Reliability Engineer operating scalable services in Azure and Kubernetes environments with a focus on reliability and performance improvements.
HPC Architect designing and optimizing high - performance computing solutions for semiconductor equipment. Collaborating with cross - functional teams to enhance compute workload capabilities.
Senior Site Reliability Engineer ensuring reliability, automation, and observability across cloud infrastructure. Focused on building self - service tools and improving performance in fast - paced environments.
Maintenance and Reliability Engineer optimizing preventive maintenance at VistaPrint's automated production facility in Venlo. Collaborating with cross - functional teams to drive continuous improvement in maintenance practices.
Senior Site Reliability Engineering Program & Compliance Manager leading process governance and operational maturity for infrastructure services at cloud contact center provider Five9.
Senior Site Reliability Engineer at Five9 designing Kubernetes on bare metal and hypervisor platforms within private cloud environments. Responsible for architecture, design, and standardization in infrastructure and automation.
DevOps engineer supporting Jenkins - based CI/CD platform in Luxembourg. Managing cloud infrastructure and providing core banking systems support within a collaborative team.
Software Engineer - DevSecOps designing modern software systems for aerospace programs at Northrop Grumman. Collaborating with multi - disciplinary teams in an Agile environment to implement DevSecOps lifecycle.
Principal Software Engineer focused on DevSecOps software factory at Northrop Grumman. Working with multi - disciplinary teams to implement DevSecOps practices for aerospace programs across various locations.