Hybrid Capacity Operations Engineer

Posted 3 hours ago

Apply now

About the role

  • Capacity Ops Engineer securing and managing GPU clusters for AI workloads at Baseten. Leading initiatives that ensure 99.9% uptime across multi-cloud environments.

Responsibilities

  • Lead Specialized Pods: Act as the lead for specific GPU pods (e.g., H100 or B200), managing the full lifecycle of acquisition, air traffic control, and maintenance for those assets.
  • Advanced Orchestration: Execute complex workload migrations and "sticky" deployment drains, ensuring deployment scheduling rules meet strict regional and compliance requirements.
  • Build for Scalability: Design and implement the "next version" of Baseten’s capacity management system to handle a 10x increase in GPU volume. Financial Modeling: Leverage your understanding of unit economics to build ROI models for GPU spend, ensuring Baseten scales profitably.
  • Cross-Team Collaboration: Partner with SRE, Infra, and FDE teams to take discrete operational tasks off their plate and verify "last mile" follow-through on infrastructure changes.
  • Incident Response: Lead capacity-crunch response by rapidly untainting and re-coordinating workloads during high-pressure outages.

Requirements

  • Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
  • 5+ years of professional work experience in a high-growth environment, preferably at a hyperscaler (GCP, AWS, Azure) or a specialized GPU provider
  • Deep expertise in Kubernetes, including hands-on experience with taints, cordons, node draining, and custom operators
  • Demonstrated experience with Go or Python in a production-level environment Strong financial literacy and the ability to model complex trade-offs between capacity reliability and cost
  • High tenacity and collaborative mindset

Benefits

  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents
  • Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
  • Paid parental leave
  • Company-facilitated 401(k)
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Job title

Capacity Operations Engineer

Job type

Experience level

Mid levelSenior

Salary

$170,000 - $230,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job