Lead the design, implementation, and optimization of observability systems
Collaborate with cross-functional teams to build robust monitoring, alerting, and telemetry solutions
Drive best practices, mentor others, and shape the strategic evolution of our observability ecosystem
Design and implement comprehensive observability solutions tailored for edge computing environments
Define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and business KPIs
Build and optimize dashboards, visualizations, and alerting systems
Implement distributed tracing and log aggregation systems
Collaborate with engineering teams to ensure applications and infrastructure at edge locations are designed with observability in mind
Drive proactive identification of issues in edge facilities
Lead incident postmortems and implement observability-driven improvements
Develop and maintain tools, scripts, and automation to enhance observability pipelines
Evaluate and integrate industry-standard observability tools
Requirements
7+ years of experience in Site Reliability Engineering, Observability Engineering, or a related field
5+ years of experience with observability tools and platforms such as Prometheus, Grafana, Splunk, ELK, OpenTelemetry, or similar
3+ years of experience with microservices, containerized environments (e.g., Kubernetes, Docker), and distributed systems, particularly in edge deployments
Experience with implementation of AIOps
Strong proficiency in programming/scripting languages (e.g., Python, java) for automation and tooling in distributed environments
Certifications in cloud platforms (Google Cloud Professional certification) or Kubernetes
Knowledge of incident management processes and tools (e.g., ServiceNow, xMatters, Opsgenie) tailored for distributed systems
Benefits
Affordable medical plan options
401(k) plan (including matching company contributions)
Employee stock purchase plan
No-cost programs including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
Join Operations Team as Senior Site Reliability Engineer driving operational excellence for cybersecurity solutions. Collaborate across teams to manage production platforms and optimize infrastructure.
Software Developer - DevOps System Administrator working within the SCMT team to enhance software application efficiency. Collaborating on tools and scripts for application lifecycle management.
DevOps Engineer managing CI/CD pipelines and Kubernetes deployments at Stefanini. Collaborating with teams to optimize application health and deployment processes.
DevOps Engineer working with development teams for seamless feature integration and deployment automation. Focus on CI/CD pipelines, monitoring solutions, and continuous process optimization.
DevOps Network Administrator managing AWS cloud environments for mission - critical systems. Ensuring performance, scalability, and security while integrating modern DevOps and cybersecurity practices.
DevOps Engineer at citema systems involved in developing IT platforms and integrating DevOps practices. Collaborating in a team to implement CI/CD pipelines and evaluate strategies.
Senior SRE Engineer managing cloud infrastructure and driving Infrastructure - as - Code adoption for Resideo. Designing resilient systems while ensuring the health of cloud platforms.
Senior Platform/DevOps Engineer at Yora shaping cloud - native CI/CD ecosystem. Collaborate with product teams and ensure stable, scalable and evolving CI/CD foundation for faster and safer product delivery.
Site Reliability Engineer at Equifax ensuring reliability and performance of distributed fault - tolerant systems. Collaborating with teams to build cost - effective systems with high uptime metrics.