Senior Site Reliability Engineer – Observability at Dimensional Fund Advisors | Hybrid Hired

About the role

Senior Site Reliability Engineer for observability platforms at Dimensional, ensuring reliability and scaling the infrastructure. Collaborating with teams on operations and engineering projects.

Responsibilities

Serve as a primary escalation point for production support involving the ELK Stack, Grafana, and New Relic
Own platform health, capacity planning, and performance tuning for on-premises observability infrastructure – including Elasticsearch cluster management, index lifecycle policies, and retention strategies
Monitor and maintain SLOs for the observability platforms, ensuring the tools engineers depend on are highly available and performant
Support engineering teams in onboarding to observability platforms – helping teams instrument their applications, build dashboards, and define meaningful alerts
Manage patching, upgrades, and configuration management across the observability stack
Collaborate with security to harden platform configurations and manage software vulnerabilities
Contribute to on-call rotations and maintain runbooks and escalation procedures
Design and build tooling/automation to reduce toil and improve the experience for teams using observability platforms
Lead or contribute to platform modernization initiatives – e.g., improving ingestion pipelines, scaling platform capacity, standardizing Grafana dashboard and alerting patterns, or evaluating new capabilities within the existing stack
Develop and maintain infrastructure-as-code (Terraform, Helm, Ansible, etc.) for platform components
Build and enforce standards around logging metrics and alerting that help engineering teams adopt observability best practices at scale
Participate in design reviews and contribute to the overall platform roadmap

Requirements

Bachelor’s degree in a technical field or equivalent practical experience
5+ years of experience in SRE, DevOps, or platform engineering roles
Deep hands-on experience with the ELK Stack – Elasticsearch cluster operations, Logstash pipeline development, Kibana, and index lifecycle management
Strong experience with Grafana, including data source integrations, dashboard design, and alerting
Solid understanding of observability principles
Experience operating on-premises infrastructure, including capacity planning, server management, and the operational tradeoffs with managed cloud services
Proficiency in Python for automation and tooling; familiarity with shell scripting
Strong Linux systems knowledge and comfort working with configuration management tools (e.g., Ansible, Chef, Puppet, etc.)
Demonstrated ability to drive incidents to resolution and communicate clearly under pressure
A bias toward automation and a low tolerance for repetitive manual work

Benefits

comprehensive benefits
educational initiatives
special celebrations of our history, culture, and growth

Similar roles

Browse all Devops Engineer jobs

31 minutes ago

LE

DevOps Manager – USAF Cloud One

Leidos

DevOps Manager responsible for managing a team for multi - cloud solutions supporting the USAF Cloud One project. Focus on scalable cloud - native solutions and CI/CD practices.

Hybrid Role

United States Devops Engineer

$131,300 - $237,350 per year

49 minutes ago

LG

Lead Cloud Site Reliability Engineer

Lloyds Banking Group

Lead Site Reliability Engineer overseeing SRE practices across Azure and GCP platforms. Driving reliability improvements and leading a team at Lloyds Banking Group.

Hybrid Role

Halifax United Kingdom Devops Engineer

£92,701 - £109,060 per year

6 hours ago

BU

DevOps Engineer – Microsoft Intune

Bundesdruckerei-Gruppe

DevOps Engineer responsible for managing Microsoft Intune operations at Bundesdruckerei GmbH. Focused on ensuring secure digital solutions for identity and data protection in Berlin.

Onsite Role

Berlin Germany Devops Engineer

8 hours ago

QO

Senior Site Reliability Engineer

qode.world

Senior Site Reliability Engineer driving observability and reliability for business - critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.

Hybrid Role

Texas United States Devops Engineer

11 hours ago

VA

DevSecOps Specialist

Vanguard

DevSecOps Specialist securing the software development lifecycle at Vanguard. Collaborating with teams to improve application security tooling and processes, and provide development guidance.

Hybrid Role

Dallas United States Devops Engineer

14 hours ago

SC

Site Reliability Engineer – Compute

Scaleway

Site Reliability Engineer automating infrastructure deployment for Scaleway's sovereign cloud products. Collaborating with product teams to enhance observability and reliability of the platform.

Hybrid Role

Paris France Devops Engineer

15 hours ago

BR

DevOps Team Lead

Bromcom

DevOps Team Lead with hands - on Azure experience at Bromcom. Leading technical delivery and team coordination for Azure infrastructure management.

Hybrid Role

Bromley United Kingdom Devops Engineer

16 hours ago

WO

Reliability Engineer

Wood

Reliability Engineer responsible for equipment reliability and safety using data - driven analysis for Wood in Aberdeen. Focus on proactive maintenance and operational efficiency.

Hybrid Role

Aberdeen United Kingdom Devops Engineer

19 hours ago

UC

Principal Safety & Reliability Engineer

Ultra Intelligence & Communications

Principal Safety and Reliability Engineer developing and supporting safety design for mission - critical aerospace systems. Engaging in design reviews and ensuring compliance with requirements.

Hybrid Role

Cambridge United Kingdom Devops Engineer

22 hours ago

BT

Cloud DevOps Engineer

BTS

Cloud DevOps Engineer playing a pivotal role in developing migration plans for Coast Guard Cloud Architecture. Collaborating with teams to ensure effectiveness and best practices in cloud implementation.

Hybrid Role

San Diego United States Devops Engineer

$200,000 - $225,000 per year