Software Engineer building and operating compute infrastructure powering OpenAI’s AI research. Optimizing Kubernetes clusters and ensuring reliability in supercomputing environments for advanced AI workloads.
Responsibilities
Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load
Requirements
Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
Proficiency in compute infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations
Bonus: background with GPU workloads, firmware management, or high-performance computing
Benefits
Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible
Relocation support for eligible employees
Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.
Staff Embedded Software Engineer designing and developing embedded software for medical devices at Stryker. Leading a technical team to create industry - leading medical technology products.
Lead Software Engineer architecting high - performance mobile solutions for Angkas. Collaborating with cross - functional teams and mentoring engineers across time zones.
Leading multiple Agile teams and providing technical leadership at Leidos. Responsible for strategic direction and collaboration with product and design stakeholders.
Senior Software Engineer driving technical leadership and mentoring in software development at Leidos. Contributing to cross - team initiatives and fostering a culture of quality in product delivery.
Junior Software Engineer contributing to software design and development in a collaborative team environment at Leidos. Working primarily with Java, Rust, and JavaScript in an Agile setting.
Team Lead guiding multiple Agile teams in software development for Leidos. Providing strategic technical leadership and ensuring delivery of user - centered software.
Software Engineer at Leidos contributing to collaborative product - focused engineering team. Designing, developing, and delivering software primarily in Java, Rust, and JavaScript.
Junior Software Engineer contributing to software design and development for the Defense Sector at Leidos. Working collaboratively in Agile teams using Java, Rust, and JavaScript.
Senior Software Engineer shaping technical direction at Leidos, leading Agile teams using Java, Rust, JavaScript, and React while mentoring engineers and driving product delivery.
Digital Engineering Solution Architect coordinating national and international programs for Leonardo's GCAP project. Engaging in complex aerospace defense systems architecture definition and software technology selection.