Hybrid Site Reliability Engineer, Technical Lead

Posted 3 months ago

Apply now

About the role

  • Lead Site Reliability Engineer building cloud-agnostic, highly-available infrastructure and leading SRE team at Mistral AI.

Responsibilities

  • Lead Site Reliability Engineer responsible for driving the infrastructure team and reporting to the Head of Engineering.
  • Empower and supervise the SRE team: remove obstacles, hire, onboard, and elevate team performance; project planning and task allocation.
  • Collaborate with stakeholders across engineering, science, and product management.
  • Design, build, and maintain scalable, highly available, fault-tolerant infrastructures for web services and ML workloads.
  • Ensure platform, inference, and model training environments are highly available and reproducible across HPC clusters.
  • Operate production systems: troubleshooting, on-call responses, user admin, data extraction, infrastructure scaling; perform root cause analyses.
  • Implement and improve monitoring, alerting, and incident response systems to minimize downtime.
  • Implement and maintain CI/CD, containerization, orchestration, monitoring, logging and alerting workflows for client-facing APIs and large training runs.
  • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
  • Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments and build a cloud-agnostic platform abstraction layer.
  • Design and develop workflows, tooling, APIs, dashboards and automation to improve reliability and performance.
  • Collaborate with security to ensure best practices and compliance; document processes and contribute to open-source and publications.

Requirements

  • 10+ years of experience in a DevOps/SRE role.
  • Experience with building and leading high-performing teams.
  • Experience with cloud computing and highly available distributed systems.
  • Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations).
  • Experience working against reliability KPIs (observability, alerting, SLAs).
  • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes, Flux).
  • Experience with monitoring, logging and observability tools (Prometheus, Grafana, ELK Stack, Datadog).
  • Experience with infrastructure-as-code tools (Terraform, CloudFormation).
  • Proficiency in scripting languages (Python, Go, Bash).
  • Understanding of networking, security, and system administration concepts.
  • Excellent problem-solving and communication skills.
  • Self-motivated and able to work well in a fast-paced startup environment.
  • Willingness to reside in or relocate to Paris or London (candidates in France & UK may be considered remotely but must visit office during onboarding and monthly).

Benefits

  • 💰 Competitive salary and equity
  • 🧑‍⚕️ Health insurance
  • 🚴 Transportation allowance
  • 🥎 Sport allowance
  • 🥕 Meal vouchers
  • 💰 Private pension plan
  • 🍼 Generous parental leave policy
  • 🌎 Visa sponsorship
  • Accommodation and travelling covered for the first month of onboarding
  • Requirement to visit local office at least 3 days per month (after onboarding)

Job title

Site Reliability Engineer, Technical Lead

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job