Lead Site Reliability Engineer building cloud-agnostic, highly-available infrastructure and leading SRE team at Mistral AI.
Responsibilities
Lead Site Reliability Engineer responsible for driving the infrastructure team and reporting to the Head of Engineering.
Empower and supervise the SRE team: remove obstacles, hire, onboard, and elevate team performance; project planning and task allocation.
Collaborate with stakeholders across engineering, science, and product management.
Design, build, and maintain scalable, highly available, fault-tolerant infrastructures for web services and ML workloads.
Ensure platform, inference, and model training environments are highly available and reproducible across HPC clusters.
Operate production systems: troubleshooting, on-call responses, user admin, data extraction, infrastructure scaling; perform root cause analyses.
Implement and improve monitoring, alerting, and incident response systems to minimize downtime.
Implement and maintain CI/CD, containerization, orchestration, monitoring, logging and alerting workflows for client-facing APIs and large training runs.
Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments and build a cloud-agnostic platform abstraction layer.
Design and develop workflows, tooling, APIs, dashboards and automation to improve reliability and performance.
Collaborate with security to ensure best practices and compliance; document processes and contribute to open-source and publications.
Requirements
10+ years of experience in a DevOps/SRE role.
Experience with building and leading high-performing teams.
Experience with cloud computing and highly available distributed systems.
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations).
Experience working against reliability KPIs (observability, alerting, SLAs).
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes, Flux).
Experience with monitoring, logging and observability tools (Prometheus, Grafana, ELK Stack, Datadog).
Experience with infrastructure-as-code tools (Terraform, CloudFormation).
Proficiency in scripting languages (Python, Go, Bash).
Understanding of networking, security, and system administration concepts.
Excellent problem-solving and communication skills.
Self-motivated and able to work well in a fast-paced startup environment.
Willingness to reside in or relocate to Paris or London (candidates in France & UK may be considered remotely but must visit office during onboarding and monthly).
Benefits
💰 Competitive salary and equity
🧑⚕️ Health insurance
🚴 Transportation allowance
🥎 Sport allowance
🥕 Meal vouchers
💰 Private pension plan
🍼 Generous parental leave policy
🌎 Visa sponsorship
Accommodation and travelling covered for the first month of onboarding
Requirement to visit local office at least 3 days per month (after onboarding)
Senior DevOps Engineer at SimCorp managing cloud environments and automating builds using Azure. Collaborating with cross - functional teams to ensure high service availability and compliance.
DevOps Senior Software Engineer at SimCorp developing high - quality software solutions for financial technology. Responsible for mentoring junior engineers and solving complex technical challenges.
DevOps Engineer designing, building, and operating software development infrastructure for CodeMettle. Leading automation and best practices to enhance value delivery across teams.
DevOps Engineer maintaining scalable infrastructure for VOX's telecom services. Implementing automation and CI/CD pipelines in a fast - paced environment with significant growth potential.
DevOps Engineer focused on designing and managing CI/CD pipelines using Azure DevOps. Collaborating with teams for application deployment and ensuring DevSecOps practices.
DevOps Engineer working closely with engineering and security teams to optimize CI/CD pipelines and manage infrastructure. Ensuring security and compliance for mission - critical financial applications.
Build and scale cloud infrastructure that powers Heidi's healthcare AI platform. Work with AWS and Azure while enhancing automation and reliability in an innovative healthtech startup.
Infrastructure - as - Code DevOps Engineer designing and managing cloud - native platforms at Vodafone. Collaborating with agile teams for digital transformation and business success.
Director of Data Engineering leading a strategic DevOps team within Enterprise AI. Balancing leadership with hands - on expertise to enable AI technology adoption.
Join a Data Engineering Team as a Senior DevOps to support multiple Data & AI initiatives. Utilize cloud technologies and enhance data pipelines in a collaborative environment.