Capacity Ops Engineer securing and managing GPU clusters for AI workloads at Baseten. Leading initiatives that ensure 99.9% uptime across multi-cloud environments.
Responsibilities
Lead Specialized Pods: Act as the lead for specific GPU pods (e.g., H100 or B200), managing the full lifecycle of acquisition, air traffic control, and maintenance for those assets.
Advanced Orchestration: Execute complex workload migrations and "sticky" deployment drains, ensuring deployment scheduling rules meet strict regional and compliance requirements.
Build for Scalability: Design and implement the "next version" of Baseten’s capacity management system to handle a 10x increase in GPU volume. Financial Modeling: Leverage your understanding of unit economics to build ROI models for GPU spend, ensuring Baseten scales profitably.
Cross-Team Collaboration: Partner with SRE, Infra, and FDE teams to take discrete operational tasks off their plate and verify "last mile" follow-through on infrastructure changes.
Incident Response: Lead capacity-crunch response by rapidly untainting and re-coordinating workloads during high-pressure outages.
Requirements
Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
5+ years of professional work experience in a high-growth environment, preferably at a hyperscaler (GCP, AWS, Azure) or a specialized GPU provider
Deep expertise in Kubernetes, including hands-on experience with taints, cordons, node draining, and custom operators
Demonstrated experience with Go or Python in a production-level environment Strong financial literacy and the ability to model complex trade-offs between capacity reliability and cost
High tenacity and collaborative mindset
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Betriebsleiter responsible for the daily operations and team management at HANS IM GLÜCK location in Münster. Focusing on productivity, team motivation, and compliance with regulations.
Operations Manager leading staff across various sites in Germany. Responsibilities include personnel management, operation coordination, and performance monitoring.
Betriebsleiter managing operations and processes in waste management services at KNETTENBRECH + GURDULIC. Leading teams and ensuring compliance with environmental standards in Mannheim.
Product Operations Manager at Kazaar improving offline marketing execution processes and collaborating across multiple teams. Aiming to ensure clarity, structure, and efficient communication in product operations.
Operationstechnische:r Assistent:in supporting surgical teams in Frankfurt. Preparing surgical environments and assisting students during their training.
Operations Manager leading core processes and ensuring compliance at Ear to the Ground. Driving strategic goals into tangible results and optimizing agency operations in Manchester.
Operations Specialist managing payments and bookings for the European boat rental platform Click&Boat. Overseeing operations, cash flow, and assisting with finance - related tasks in a hybrid role based in Barcelona.
Neuropsychologist providing therapy and diagnostics in clinical settings for neurological patients. Collaborating interprofessionally and maintaining comprehensive documentation for treatment.
Subject matter expert responsible for management of supply chain compliance at Avnet. Overseeing trade compliance and regulations in the region with leadership responsibilities.
Senior Director leading Digital Transformation initiatives at Regeneron. Focusing on strategy, project management, and collaborations in life sciences.