Hybrid Senior Site Reliability Engineer – AI/ML Optimized GPU Clusters

Posted last month

Apply now

About the role

  • Senior Site Reliability Engineer at a company operating one of the largest GPU infrastructures. Responsible for ensuring service fault-tolerance and using cloud technology for infrastructure solutions.

Responsibilities

  • Ensure fault-tolerance, scale, and uninterrupted operations for the service
  • Use cutting-edge cloud technology to solve a variety of infrastructure problems
  • Implement and improve CI/CD processes

Requirements

  • Solid experience with programming languages (like Go, Python, or C++)
  • Experience in environments with a multitude of GPUs distributed over multiple nodes
  • Good understanding of classic algorithms and data structures
  • Commercial experience with, and deep understanding of, Unix/Linux systems and network technology
  • Solid experience with CI/CD and IaC
  • Experience with containerization and configuration management (Ansible, Salt, Terraform, Docker, Kubernetes, Helm)

Benefits

  • Competitive salary and comprehensive benefits package
  • Opportunities for professional growth
  • Flexible working arrangements
  • Dynamic and collaborative work environment

Job title

Senior Site Reliability Engineer – AI/ML Optimized GPU Clusters

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job