Engineering Manager leading Support Engineers to enhance observability and operational practices in AI production environments. Overseeing runtime debugging and incident resolution while fostering a customer-first mindset.
Responsibilities
Lead, mentor, and scale a team of Support Engineers specializing in AI and ML production environments, fostering technical depth, accountability, and a customer-first mindset.
Serve as a player-coach, directly contributing to complex troubleshooting, inference optimization, and incident resolution for high-value enterprise customers.
Diagnose and resolve runtime issues impacting model performance, such as latency spikes, memory pressure, GPU scheduling, and concurrency management.
Debug Kubernetes infrastructure (pods, controllers, networking) and observability stacks using tools like Grafana, Loki, and Prometheus.
Own critical incidents end-to-end — coordinating across Engineering, Product, and Sales to ensure timely resolution, transparent communication, and SLA compliance.
Drive continuous improvement by enhancing diagnostic runbooks, refining alerting strategies, and developing internal automation for faster root-cause analysis.
Collaborate with product and platform teams to surface insights from production issues — shaping roadmap priorities around reliability, inference efficiency, and operational scalability.
Lead initiatives that enhance observability, monitoring, and alerting for AI workloads across distributed compute environments.
Balance tactical execution with strategic vision, ensuring your team not only resolves today’s issues but also builds systems that prevent tomorrow’s.
Requirements
Proven experience leading or mentoring technical teams in Support Engineering, Infrastructure, or Site Reliability within production AI/ML or distributed systems environments.
Deep Kubernetes troubleshooting expertise, including advanced resource debugging, runtime performance analysis, and observability-driven diagnostics.
Hands-on experience managing distributed systems or AI products at scale — optimizing GPU/CPU utilization, batch sizing, concurrency, and memory efficiency.
Expertise with observability and monitoring tools (Grafana, Prometheus, Loki) and alerting best practices.
Skilled in incident management and customer escalation handling, with a proven ability to drive clarity and confidence in high-stakes situations.
Demonstrated project management and organizational skills, capable of orchestrating multi-stakeholder efforts from incident triage through resolution and RCA.
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Job title
Engineering Manager, Support – Customer Engineering
Engineering Manager overseeing software projects and team development for AI solutions at Veritone. Leading engineering practices and cultivating a supportive and innovative team culture.
Technical Engineering Manager overseeing engineering team and AI - driven workflow development for Ironclad, an AI contracting platform. Engaging in hands - on coding and strategic leadership for contract management tools.
Repairs Engineering Manager leading a team for wind turbine maintenance and repair procedures development. Collaborating across functions to enhance service capabilities and safety standards.
Repairs Engineering Manager overseeing engineering team responsible for repair procedures and tooling for wind turbines. Leading projects to enhance maintenance capabilities and safety for GE Vernova's operations.
Software Development Manager directing software developers to enhance Autodesk's Viewer platform. Responsible for execution, talent development, and high - quality delivery in visualization products.
Engineering Manager leading Revenue pod responsible for core product surfaces like conversation intelligence. Building AI - driven technology in a high - growth team environment.
Head of Engineering responsible for product integrity and engineering leadership at Nevados. Driving execution and effectiveness in promoting solar tracker technology and partnerships.
Software Development Manager leading Agate Software’s Project Development team through growth. Driving innovation in grant management technology for government agencies and nonprofits.
Senior Manager responsible for developing and scaling ML Experimentation Platform at CrowdStrike. Overseeing data and ML infrastructure while fostering a culture of innovation and excellence.
Engineering Manager overseeing a team to enhance clinical outcomes for healthcare technology company. Driving product development initiatives related to member - facing health assessments and data visualization.