Hybrid Lead Systems Engineer – HPC, AI

Posted 3 weeks ago

Apply now

About the role

  • Lead Systems Engineer managing AI platform operations at emerging AI infrastructure start-up. Oversee vendor collaboration, technical troubleshooting, and customer engagement for optimal service delivery.

Responsibilities

  • Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses
  • Monitor system health, alerts, and customer usage patterns
  • Document solutions/workarounds, create and maintain knowledge, document support procedures
  • Automate common tasks and fixes
  • Configure and integrate tooling to support optimal operation of the platform, and support tool selection
  • Assist customers with platform configuration, onboarding, and usage best practices
  • Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues
  • Ensure SLAs and customer satisfaction targets are met
  • L1 support for customer-reported issues and requests
  • L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure
  • Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing

Requirements

  • Extensive experience in technical support, system engineering, or platform operations
  • Solid understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting)
  • Familiarity with cloud-based platforms, APIs, and distributed systems
  • Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics)
  • Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk)
  • Excellent communication skills to interface with both customers and internal / vendor teams
  • Good understanding of tools requirements for ML engineers and data scientists, and how to optimize the experience
  • System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel
  • Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration
  • Understanding of automation, monitoring and security with GPU as a service.

Benefits

  • Health insurance
  • Professional development opportunities

Job title

Lead Systems Engineer – HPC, AI

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job