Engineering Manager leading Support Engineers to enhance observability and operational practices in AI production environments. Overseeing runtime debugging and incident resolution while fostering a customer-first mindset.
Responsibilities
Lead, mentor, and scale a team of Support Engineers specializing in AI and ML production environments, fostering technical depth, accountability, and a customer-first mindset.
Serve as a player-coach, directly contributing to complex troubleshooting, inference optimization, and incident resolution for high-value enterprise customers.
Diagnose and resolve runtime issues impacting model performance, such as latency spikes, memory pressure, GPU scheduling, and concurrency management.
Debug Kubernetes infrastructure (pods, controllers, networking) and observability stacks using tools like Grafana, Loki, and Prometheus.
Own critical incidents end-to-end — coordinating across Engineering, Product, and Sales to ensure timely resolution, transparent communication, and SLA compliance.
Drive continuous improvement by enhancing diagnostic runbooks, refining alerting strategies, and developing internal automation for faster root-cause analysis.
Collaborate with product and platform teams to surface insights from production issues — shaping roadmap priorities around reliability, inference efficiency, and operational scalability.
Lead initiatives that enhance observability, monitoring, and alerting for AI workloads across distributed compute environments.
Balance tactical execution with strategic vision, ensuring your team not only resolves today’s issues but also builds systems that prevent tomorrow’s.
Requirements
Proven experience leading or mentoring technical teams in Support Engineering, Infrastructure, or Site Reliability within production AI/ML or distributed systems environments.
Deep Kubernetes troubleshooting expertise, including advanced resource debugging, runtime performance analysis, and observability-driven diagnostics.
Hands-on experience managing distributed systems or AI products at scale — optimizing GPU/CPU utilization, batch sizing, concurrency, and memory efficiency.
Expertise with observability and monitoring tools (Grafana, Prometheus, Loki) and alerting best practices.
Skilled in incident management and customer escalation handling, with a proven ability to drive clarity and confidence in high-stakes situations.
Demonstrated project management and organizational skills, capable of orchestrating multi-stakeholder efforts from incident triage through resolution and RCA.
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Job title
Engineering Manager, Support – Customer Engineering
Assistant Site Manager for TRIMEDX Clinical Engineering Leadership Program, managing operations and developing associates towards promotion as Clinical Engineering Site Manager. Ensuring adherence to TRIMEDX Medical Equipment Management Plan and regulatory compliance.
Senior Structures Engineering Manager providing leadership across Mobility Surveillance and Bombers at Boeing. Responsible for safety, airworthiness, and technical execution on key programs.
Software Engineering Manager overseeing Data Platform delivery with a focus on mentoring engineers at U.S. Bank. Responsible for cross - team alignment and technical solution delivery in fintech.
Senior Structures Engineering Manager at Boeing responsible for structures functional leadership across various programs. Leading technical execution while ensuring safety and airworthiness across capabilities.
Engineering Manager leading engineering projects for a Tourism Marketing Agency in London. Driving technology advancements to enhance marketing efforts and customer engagement.
Engineering Manager leading the engineering department at Linamar, a global manufacturer of precision components for automotive, energy, and mobile industries.
Project Manager responsible for engineering management at Blue Yonder, a digital supply chain leader. Focused on delivering project objectives and supporting team development in a cloud - based environment.
Software Engineering Manager leading a small team at a Fintech startup optimizing the lending process in Canada. Responsibilities include technical leadership, team management, and product execution.
Senior Engineering Manager overseeing complex naval ship projects as part of a global defence organisation. Leading teams in delivering technical scopes safely and effectively during a secondment in Indonesia.
Operations Engineering Manager 3 managing engineering operations in Fort Worth, supporting production business units and driving project success. Requires extensive experience in engineering and team leadership.