Staff Site Reliability Engineer focusing on observability at CVS Health. Leading design and implementation of observability systems across distributed environments and edge computing.
Responsibilities
Lead the design, implementation, and optimization of observability systems
Collaborate with cross-functional teams to build robust monitoring, alerting, and telemetry solutions
Drive best practices, mentor others, and shape the strategic evolution of our observability ecosystem
Design and implement comprehensive observability solutions tailored for edge computing environments
Define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and business KPIs
Build and optimize dashboards, visualizations, and alerting systems
Implement distributed tracing and log aggregation systems
Collaborate with engineering teams to ensure applications and infrastructure at edge locations are designed with observability in mind
Drive proactive identification of issues in edge facilities
Lead incident postmortems and implement observability-driven improvements
Develop and maintain tools, scripts, and automation to enhance observability pipelines
Evaluate and integrate industry-standard observability tools
Requirements
7+ years of experience in Site Reliability Engineering, Observability Engineering, or a related field
5+ years of experience with observability tools and platforms such as Prometheus, Grafana, Splunk, ELK, OpenTelemetry, or similar
3+ years of experience with microservices, containerized environments (e.g., Kubernetes, Docker), and distributed systems, particularly in edge deployments
Experience with implementation of AIOps
Strong proficiency in programming/scripting languages (e.g., Python, java) for automation and tooling in distributed environments
Certifications in cloud platforms (Google Cloud Professional certification) or Kubernetes
Knowledge of incident management processes and tools (e.g., ServiceNow, xMatters, Opsgenie) tailored for distributed systems
Benefits
Affordable medical plan options
401(k) plan (including matching company contributions)
Employee stock purchase plan
No-cost programs including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
DevOps Engineer maintaining and automating infrastructure and CI/CD processes for cybersecurity solutions by NordLayer. Collaborating with teams to ensure performance and scalability of cloud services.
DevOps Engineer maintaining and improving infrastructure and CI/CD processes for cybersecurity solutions provider. Collaborating with cross - functional teams for reliable and scalable cloud solutions.
DevOps Engineer maintaining and automating infrastructure and CI/CD processes at NordLayer. Collaborating with Senior Engineers to implement best practices in a dynamic cybersecurity environment.
Secure DevOps Engineer responsible for integrating security into CI/CD pipelines and strengthening AWS infrastructure. Key expertise in AWS security and container management.
DevOps Engineer responsible for CI/CD pipeline development and automation for urban software solutions. Collaborating with teams to enhance efficiency in software deployment and infrastructure.
DevOps Engineer managing cloud and on - premise platforms for a public sector infrastructure project. Collaboration primarily remote, with occasional on - site meetings.
DevSecOps Engineer architecting CI/CD framework services for Truist, enhancing the flow of business value through DevSecOps practices. Building and maintaining automation for software delivery and operations.
Application Security Manager at Evertec, handling security strategy and implementation in financial tech. Leading efforts in Application Security, DevSecOps, and compliance with financial regulations.
Databricks Senior DevOps Engineer designing and operating platforms on AWS and Databricks for Financial Crime. Focused on platform infrastructure, governance, security, and operations.
Site Reliability Engineer at Assecor, focusing on SLIs, SLOs, and incident management. Enhancing performance and reliability through observability and automation in a hybrid work environment.