Lead Site Reliability Engineer building cloud-agnostic, highly-available infrastructure and leading SRE team at Mistral AI.
Responsibilities
Lead Site Reliability Engineer responsible for driving the infrastructure team and reporting to the Head of Engineering.
Empower and supervise the SRE team: remove obstacles, hire, onboard, and elevate team performance; project planning and task allocation.
Collaborate with stakeholders across engineering, science, and product management.
Design, build, and maintain scalable, highly available, fault-tolerant infrastructures for web services and ML workloads.
Ensure platform, inference, and model training environments are highly available and reproducible across HPC clusters.
Operate production systems: troubleshooting, on-call responses, user admin, data extraction, infrastructure scaling; perform root cause analyses.
Implement and improve monitoring, alerting, and incident response systems to minimize downtime.
Implement and maintain CI/CD, containerization, orchestration, monitoring, logging and alerting workflows for client-facing APIs and large training runs.
Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments and build a cloud-agnostic platform abstraction layer.
Design and develop workflows, tooling, APIs, dashboards and automation to improve reliability and performance.
Collaborate with security to ensure best practices and compliance; document processes and contribute to open-source and publications.
Requirements
10+ years of experience in a DevOps/SRE role.
Experience with building and leading high-performing teams.
Experience with cloud computing and highly available distributed systems.
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations).
Experience working against reliability KPIs (observability, alerting, SLAs).
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes, Flux).
Experience with monitoring, logging and observability tools (Prometheus, Grafana, ELK Stack, Datadog).
Experience with infrastructure-as-code tools (Terraform, CloudFormation).
Proficiency in scripting languages (Python, Go, Bash).
Understanding of networking, security, and system administration concepts.
Excellent problem-solving and communication skills.
Self-motivated and able to work well in a fast-paced startup environment.
Willingness to reside in or relocate to Paris or London (candidates in France & UK may be considered remotely but must visit office during onboarding and monthly).
Benefits
💰 Competitive salary and equity
🧑⚕️ Health insurance
🚴 Transportation allowance
🥎 Sport allowance
🥕 Meal vouchers
💰 Private pension plan
🍼 Generous parental leave policy
🌎 Visa sponsorship
Accommodation and travelling covered for the first month of onboarding
Requirement to visit local office at least 3 days per month (after onboarding)
Ingénieur Infrastructure DevOps chez Bull, renforçant l'équipe AdminLab Echirolles. Travailler sur des infrastructures Linux et des pratiques d'automatisation dans un environnement HPC.
Product Quality & Reliability Engineer developing quality/reliability standards for Applied Materials. Design methods for testing products and analyze operational data in a supportive team environment.
DevOps System Engineer creating and managing infrastructure for ESET's global SaaS service. Collaborating with tech teams to maintain secure and stable operations.
Provides expertise in business applications design and functionality. Supports users and validates technical designs for alignment with business needs.
Senior Site Reliability Engineer supporting the reliability and performance of Broadridge’s fintech platform. Collaborating with senior engineers on automation, infrastructure, and production stability.
DevOps Engineer at Mindera focusing on Windows environments and Azure cloud solutions. Involves system modernization, automation, and migration projects with collaborative teams.
DevSecOps Lead supporting Synthesized's cloud automation strategy with a focus on security and compliance. Collaborating closely with development teams to shape cloud architecture and enhance deployment processes.
DevOps Engineer managing technical implementation and operational maintenance for Consort Group's ecosystem. Collaborating in project phases and optimizing processes in a hybrid work environment.
DevOps Engineer at AddSecure designing and developing modern cloud infrastructure. Involved with IoT solutions and scaling services using AWS, Azure, and Terraform.
Engineer responsible for designing and maintaining SCM, CI/CD, and Software Delivery processes for an international engineering services company. Collaborate in a hybrid environment with advanced technology projects.