Hybrid Site Reliability Engineer – Senior

Posted 13 hours ago

Apply now

About the role

  • SRE Senior Engineer ensuring the reliability of large-scale distributed systems at Beyond Soluções. Overseeing data platform SLIs and SLOs while implementing automation and advanced observability.

Responsibilities

  • Reliability Engineering: Define and monitor critical SLIs and SLOs for the data platform (job latency, workspace availability, Delta Lake integrity).
  • Advanced Observability: Implement end-to-end telemetry (logs, metrics and traces) to detect failures before they impact the business.
  • Automation and IaC: Eliminate manual work through automation, ensuring Databricks infrastructure is treated as code.
  • Incident Management and Post-mortems: Lead diagnosis of complex incidents in Spark/Azure environments and conduct blameless root-cause analyses to prevent recurrence.
  • Cost Efficiency (FinOps): Optimize consumption of compute resources (Databricks clusters) and Azure storage without compromising performance.
  • Self-Service Culture: Develop tools and abstractions that enable Data Engineers to operate autonomously and securely.
  • Capacity Planning: Manage platform capacity to support exponential growth in data volumes and AI/ML models.

Requirements

  • Experience in SRE or DevOps: Solid background ensuring availability of large-scale distributed systems.
  • Data Ecosystem Expertise: Mandatory experience (2+ years) with Azure and Databricks (especially workspace administration and cluster optimization).
  • Programming and Automation: Proficient in Python for building automation tools and scripts.
  • Big Data Troubleshooting: Deep knowledge of debugging Apache Spark jobs, analyzing bottlenecks in Delta Lake and cloud networking.
  • Observability: Experience with tools such as Azure Monitor, Grafana, Prometheus or Datadog for creating intelligent alerts.
  • Proven experience with Azure and Databricks is desirable.
  • Experience with CI/CD for Data Engineering (DataOps).
  • Familiarity with data governance and security (Unity Catalog).

Benefits

  • Flexible Meal and Food Allowance
  • Health Insurance
  • Dental Plan
  • Wellhub and TotalPass
  • Bio Ritmo gym exclusive for employees: at the Headquarters complex
  • Profit Sharing (PLR)
  • Equity Program: "Porto em Ação" — complementary to PLR until 2025
  • Sand and multipurpose courts: at the Headquarters complex
  • Transportation Voucher / Commuting Allowance
  • Van transportation services; available at main access stations to Porto (Luz, Barra Funda, Santa Cecília and Júlio Prestes)
  • Extended Parental Leave: up to 40 days for all family configurations
  • Extended Maternity Leave: 6 months
  • On-site Medical Clinic with specialties: at Headquarters and Barra Funda
  • Childcare or nanny subsidy
  • Life Insurance
  • Private Pension Plan - PortoPrev
  • Discounts on Products and Services
  • Tuition Assistance: reimbursement for undergraduate, graduate or MBA programs
  • Monthly running events: subsidy for major road races in São Paulo
  • Language reimbursement (English or Spanish)
  • Porto Theater: exclusive sessions for employees
  • Library
  • Rest room: at the Headquarters complex
  • Game room: at the Headquarters complex
  • Massage and podiatry services: at the Headquarters complex

Job title

Site Reliability Engineer – Senior

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job