Senior Manager of Site Reliability Engineering at Insulet overseeing SRE practices and team leadership to enhance system reliability. Driving automation, incident response, and partnership across engineering and product teams.
Responsibilities
Lead the execution and continuous improvement of SRE practices across assigned platforms and services, reinforcing a culture of reliability, efficiency, and operational ownership
Manage and evolve automation strategies that reduce operational toil, improve system reliability, and increase engineering productivity
Design, implement, and operate observability, monitoring, and alerting solutions that provide actionable insight into system health, availability, and performance
Own and lead high‑severity incident response for supported services, ensuring effective triage, coordination, root cause analysis, and completion of corrective and preventative actions
Analyze reliability, performance, and capacity metrics to identify risks, drive proactive improvements, and support long‑term system resilience
Partner with software engineering, product, and infrastructure teams to embed SRE principles throughout the development lifecycle and influence architecture and design decisions
Build, coach, and develop SRE managers and engineers, fostering technical excellence, career growth, and strong on‑call and operational practices
Support capacity planning, scalability assessments, and demand forecasting for critical systems and services
Ensure SRE processes, standards, and best practices are well documented, understood, and consistently applied
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
12+ years of overall engineering experience, including 5+ years in Site Reliability Engineering, DevOps, or a similar role
3+ years of experience leading engineering teams or managing senior technical contributors
Strong experience with observability and monitoring platforms such as Datadog, Prometheus, Dynatrace, Grafana, ELK, or similar
Proficiency in at least one programming language such as Python, Go, or Java
Hands‑on experience with cloud platforms (AWS, Azure, or GCP) and container orchestration technologies (Docker, Kubernetes)
Solid working knowledge of AWS services such as VPC, EC2, ELB, ECS, EKS, Lambda, IAM, CloudWatch, S3, SQS, SNS, Route53, and WAF
Experience with infrastructure‑as‑code tools such as Terraform, Ansible, or equivalents
Strong troubleshooting and problem‑solving skills in distributed systems environments
Working knowledge of security best practices and operational risk management
Experience with resilience testing, chaos engineering, or failure‑injection techniques
Senior Site Reliability Engineer focusing on reliability and operational excellence of workflow orchestration platforms like Apache Airflow. Engaging in operations and engineering projects in a hybrid setup.
Senior Site Reliability Engineer for observability platforms at Dimensional, ensuring reliability and scaling the infrastructure. Collaborating with teams on operations and engineering projects.
Senior Staff Reliability Engineer for the humanoid robotics team ensuring performance and safety standards. Leading reliability engineering initiatives and mentoring within the engineering team.
Reliability Engineer at Air Liquide optimizing maintenance strategies, ensuring equipment uptime across multiple sites in the United States. Collaborating with teams for continuous improvement and operational excellence.
Senior Azure Engineer at Capgemini responsible for building, operating, and optimizing cloud - native platforms. Collaborating with teams to ensure reliability, performance, and security for critical workloads.
DevOps Engineer specialized in Cloud environments at Avanquest, planning and migrating services to the Cloud and implementing microservice architectures.
Lead DevOps Engineer designing cloud infrastructure for ML/AI solutions in medical imaging. Collaborating across teams for scalable, secure platforms that optimize data operations.
DevOps/SRE Engineer for cloud environments developing ERP software at Scopevisio. Focus on AWS, infrastructure scaling, and modern technologies in a collaborative team.
Senior Coordinator for Infrastructure and DevOps leading technological infrastructure strategy and team development at RD Saúde. Ensuring stability, security, and cost efficiency in cloud operations.
Azure DevOps IT Engineer at iKnowHealth managing cloud and hybrid solutions with Microsoft Azure. Responsible for optimizing infrastructure and ensuring system performance in healthcare software.