Senior Reliability Engineer applying software engineering principles to operations at The Hartford. Ensuring reliability, performance, scalability of data infrastructure and leading transition to modern RE model.
Responsibilities
Design, build, and maintain highly reliable, scalable, and resilient cloud-based data platforms on AWS and GCP, including core infrastructure and services like Snowflake, EKS, OpenSearch, EMR and Hadoop ecosystems.
Champion the RE mandate by identifying manual, repetitive operational tasks (toil) and developing robust automation solutions to eliminate them.
Implement and manage comprehensive observability solutions (monitoring, alerting, logging, tracing) for the underlying data infrastructure, applications focusing on establishing clear Service Level Indicators (SLIs), Service Level Objectives (SLOs).
Act as an escalation point for production incidents, leading incident response, performing deep root cause analysis (RCA), designing error budgets and implementing preventative measures to ensure issues do not recur
Lead the standardization of operational processes and documentation, including the creation and automation of dynamic runbooks and playbooks for consistent and efficient incident resolution and service management.
Leads as RE Subject Matter Expert and collaborate with other Platform, Product and Data Engineering Support teams to instill RE best practices, including participation in system design consulting, capacity planning, and deployment pipelines (CI/CD).
Requirements
10+ year’s overall experience in an Infrastructure, Data or related technology organization with increasing responsibilities as a hands-on technologist.
Must have 5+ year experience as an RE, Cloud, DevOps Engineer, or similar role supporting large-scale enterprise infrastructure and applications.
Strong scripting and programming skills (Python etc.) for automation and tooling development.
Experience with infrastructure-as-code (e.g., Terraform, CloudFormation, Ansible) and CI/CD tools.
Experience designing and operating reliable and resilient infrastructure, fail-safe patterns, reliability controls, and observability from a Reliability Engineering (SRE/RE) infrastructure support perspective across cloud and big data platforms (AWS, GCP , Amazon EMR, Hadoop/Spark, OpenSearch, and container orchestration platforms etc.)
Familiarity with cloud-native integrations with databases, data integration, and business intelligence platforms (Snowflake, Informatica IDMC, Tableau, and ThoughtSpot etc.)
Expertise in setting up and tuning monitoring and alerting systems (e.g., Dynatrace, Splunk, Prometheus, Grafana, Datadog, Open Telemetry etc.).
Expertise defining and implementing of DataOps practices
Expertise implementing AIOps to monitor, manage and self-heal infrastructure, data platforms, experience implementing machine learning principles for anomaly detection, alerting and runbook automation.
Experience with prompt engineering, implementing AWS or Google AI services, AI enabled automation for infrastructure reliability and performance management.
Relevant industry certifications preferred (AWS, GCP, Kubernetes, SRE/DevOps frameworks etc.)
Candidates must be authorized to work in the US without company sponsorship.
Benefits
Other rewards may include short-term or annual bonuses
Senior Site Reliability Engineer focused on developing and maintaining OpenShift - based platform solutions at Red Hat. Responsible for software automation, onboarding new services, and maintaining service reliability.
Site Reliability Engineer at Red Hat designing Python and Golang solutions for managed services. Involves onboarding services, maintaining reliability, and fostering team excellence.
Development Operations Engineer supporting enterprise application development in Java and/or C. Ensuring high availability and operational excellence in modern payment solutions.
Site Reliability Engineer designing and supporting Kubernetes environments for F5's UDF platform. Collaborating with cross - functional teams to ensure reliability and operational excellence.
Senior Site Reliability Engineer ensuring operational excellence for multi - datacenter infrastructure at F5. Developing automation tools and APIs in Python and Go.
DevOps Engineer needed to develop a new OpenXDR solution on AWS, processing security data from multiple sources. Join a leading cybersecurity company in Slovakia.
DevOps Engineer at Castalia Systems automating and optimizing toolchain and CI/CD pipelines. Designing Azure infrastructure and ensuring collaboration between development and operations teams.
Senior DevOps Engineer managing Kubernetes and AI - driven workflows at Hex Trust. Supporting blockchain infrastructure while implementing best DevOps practices.