Principal Systems At-Scale Engineer deploying strategies to improve large-scale data center clusters. Collaborating with visionary professionals to optimize systems in AI and GPU computing.
Responsibilities
Deploy strategies to analyze and collect debugging and anomaly signals from large fleets of clusters to improve quality and experience.
Build and expand debugging tools to identify, diagnose, and recover out-of-service systems, growing customer-available capacity.
Author and deploy "fault signatures" and automated recovery rules.
Lead cross-team task forces to address undefined failure modes in high-value AI/GPU systems, cutting backlogs through data-driven isolation.
Leverage AI, analytics, and efficiency tools to scale debug efforts, turning manual triage into productized, automated code.
Act as a technical leader and cultural anchor.
Mentor junior and senior engineers.
Encourage organizational health initiatives.
Promote innovation through hackathons and sharing sessions.
Requirements
15+ years of experience in systems debugging at scale and debugging components of large fleets.
BS/MS Computer Science or related field (or equivalent experience)
Proven understanding of performance clusters, infrastructure, and workload patterns.
Knowledge and experience with telemetry and at-scale analytics for large platforms.
Experience using and installing fleets of Linux-based server platforms.
Collaboration Engineer implementing and supporting collaboration systems for Encova Insurance. Involves managing Microsoft Teams, Cisco systems, and project tasks in a hybrid work environment.
Senior Transient Analysis Engineer responsible for hydraulic analysis in EPC projects. Leading technical roles on transient phenomena and collaborating with multi - disciplinary teams.
Junior Signal Processing Engineer supporting Electronic Warfare programs at Leidos. Developing algorithms and analyzing data to enhance sensor efficacy in complicated environments.
Software Engineer developing a computer exploitation framework for cybersecurity at Kudu Dynamics. Collaborating with vulnerability researchers to build reliable exploit chains in constrained environments.
RAMT Logistic Engineer developing integration and system analysis requirements in the aerospace sector. Collaborating on reliability and maintainability aspects of aerospace technology projects.
Remote Sensing Engineer developing and verifying cloud products derived from satellite data at EUMETSAT. Contributing to MTG and EPS - SG missions with scientific analysis and operational processing.
County Engineer for Southern Water managing sewerage projects. Responsible for investment planning and problem - solving in the sewerage system for effective service delivery.
Maintenance Engineer ensuring reliable operation of mechanical systems at Rain Carbon. Overseeing maintenance tasks and troubleshooting equipment issues in an industrial environment.
Maintenance Engineer ensuring reliable operation of mechanical equipment at Rain Carbon Inc. Supervising maintenance tasks, troubleshooting, and coordinating improvements in operational efficiency.
Optomechanical Sr Engineer leading the design and analysis of high - performance imaging hardware at Satellogic. Working on telescopes and optical elements assemblies with a team of engineers.