Engineering Manager leading Support Engineers to enhance observability and operational practices in AI production environments. Overseeing runtime debugging and incident resolution while fostering a customer-first mindset.
Responsibilities
Lead, mentor, and scale a team of Support Engineers specializing in AI and ML production environments, fostering technical depth, accountability, and a customer-first mindset.
Serve as a player-coach, directly contributing to complex troubleshooting, inference optimization, and incident resolution for high-value enterprise customers.
Diagnose and resolve runtime issues impacting model performance, such as latency spikes, memory pressure, GPU scheduling, and concurrency management.
Debug Kubernetes infrastructure (pods, controllers, networking) and observability stacks using tools like Grafana, Loki, and Prometheus.
Own critical incidents end-to-end — coordinating across Engineering, Product, and Sales to ensure timely resolution, transparent communication, and SLA compliance.
Drive continuous improvement by enhancing diagnostic runbooks, refining alerting strategies, and developing internal automation for faster root-cause analysis.
Collaborate with product and platform teams to surface insights from production issues — shaping roadmap priorities around reliability, inference efficiency, and operational scalability.
Lead initiatives that enhance observability, monitoring, and alerting for AI workloads across distributed compute environments.
Balance tactical execution with strategic vision, ensuring your team not only resolves today’s issues but also builds systems that prevent tomorrow’s.
Requirements
Proven experience leading or mentoring technical teams in Support Engineering, Infrastructure, or Site Reliability within production AI/ML or distributed systems environments.
Deep Kubernetes troubleshooting expertise, including advanced resource debugging, runtime performance analysis, and observability-driven diagnostics.
Hands-on experience managing distributed systems or AI products at scale — optimizing GPU/CPU utilization, batch sizing, concurrency, and memory efficiency.
Expertise with observability and monitoring tools (Grafana, Prometheus, Loki) and alerting best practices.
Skilled in incident management and customer escalation handling, with a proven ability to drive clarity and confidence in high-stakes situations.
Demonstrated project management and organizational skills, capable of orchestrating multi-stakeholder efforts from incident triage through resolution and RCA.
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Job title
Engineering Manager, Support – Customer Engineering
Technical Software Engineering Manager leading engineers in API design for the award - winning CommBank App. Empowering and guiding the team while contributing to hands - on coding.
Software Engineering Manager at Cargill developing and delivering software projects for food and agriculture solutions. Leading a team to achieve operational excellence and implement best practices.
Software Engineering Manager overseeing software support teams at Boeing. Driving software delivery and team collaboration for precision engagement systems in defense.
Engineering Manager overseeing project and program management for optimal production efficiency at Flowers Foods bakery. Engaging with plant leaders to uphold safety and quality standards.
Manager of Software Engineering at 1 - 800 Contacts guiding a team in developing applications. Leading backend services with C# and PHP in a hybrid work environment.
Engineering Manager ensuring compliance with client needs in systems engineering. Overseeing architecture, development, integration, and verification of complex systems at Thales in France.
Engineering Manager directing manufacturing plant activities for lean engineering and operational support. Leading engineering teams and ensuring successful product launches and quality standards adherence.
Software Engineering Manager focused on building and mentoring engineering teams. Leading delivery of high - quality software and fostering a collaborative culture within a digital - first organization.
Director of Software Engineering leading a high - performing engineering organization at Armanino. Delivering secure, scalable products and driving software development best practices in a collaborative environment.