Hybrid Senior Site Reliability Engineer – Observability

Posted 10 hours ago

Apply now

About the role

  • Senior Site Reliability Engineer for observability platforms at Dimensional, ensuring reliability and scaling the infrastructure. Collaborating with teams on operations and engineering projects.

Responsibilities

  • Serve as a primary escalation point for production support involving the ELK Stack, Grafana, and New Relic
  • Own platform health, capacity planning, and performance tuning for on-premises observability infrastructure – including Elasticsearch cluster management, index lifecycle policies, and retention strategies
  • Monitor and maintain SLOs for the observability platforms, ensuring the tools engineers depend on are highly available and performant
  • Support engineering teams in onboarding to observability platforms – helping teams instrument their applications, build dashboards, and define meaningful alerts
  • Manage patching, upgrades, and configuration management across the observability stack
  • Collaborate with security to harden platform configurations and manage software vulnerabilities
  • Contribute to on-call rotations and maintain runbooks and escalation procedures
  • Design and build tooling/automation to reduce toil and improve the experience for teams using observability platforms
  • Lead or contribute to platform modernization initiatives – e.g., improving ingestion pipelines, scaling platform capacity, standardizing Grafana dashboard and alerting patterns, or evaluating new capabilities within the existing stack
  • Develop and maintain infrastructure-as-code (Terraform, Helm, Ansible, etc.) for platform components
  • Build and enforce standards around logging metrics and alerting that help engineering teams adopt observability best practices at scale
  • Participate in design reviews and contribute to the overall platform roadmap

Requirements

  • Bachelor’s degree in a technical field or equivalent practical experience
  • 5+ years of experience in SRE, DevOps, or platform engineering roles
  • Deep hands-on experience with the ELK Stack – Elasticsearch cluster operations, Logstash pipeline development, Kibana, and index lifecycle management
  • Strong experience with Grafana, including data source integrations, dashboard design, and alerting
  • Solid understanding of observability principles
  • Experience operating on-premises infrastructure, including capacity planning, server management, and the operational tradeoffs with managed cloud services
  • Proficiency in Python for automation and tooling; familiarity with shell scripting
  • Strong Linux systems knowledge and comfort working with configuration management tools (e.g., Ansible, Chef, Puppet, etc.)
  • Demonstrated ability to drive incidents to resolution and communicate clearly under pressure
  • A bias toward automation and a low tolerance for repetitive manual work

Benefits

  • comprehensive benefits
  • educational initiatives
  • special celebrations of our history, culture, and growth

Job title

Senior Site Reliability Engineer – Observability

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job