SRE Senior Engineer ensuring the reliability of large-scale distributed systems at Beyond Soluções. Overseeing data platform SLIs and SLOs while implementing automation and advanced observability.
Responsibilities
Reliability Engineering: Define and monitor critical SLIs and SLOs for the data platform (job latency, workspace availability, Delta Lake integrity).
Advanced Observability: Implement end-to-end telemetry (logs, metrics and traces) to detect failures before they impact the business.
Automation and IaC: Eliminate manual work through automation, ensuring Databricks infrastructure is treated as code.
Incident Management and Post-mortems: Lead diagnosis of complex incidents in Spark/Azure environments and conduct blameless root-cause analyses to prevent recurrence.
Cost Efficiency (FinOps): Optimize consumption of compute resources (Databricks clusters) and Azure storage without compromising performance.
Self-Service Culture: Develop tools and abstractions that enable Data Engineers to operate autonomously and securely.
Capacity Planning: Manage platform capacity to support exponential growth in data volumes and AI/ML models.
Requirements
Experience in SRE or DevOps: Solid background ensuring availability of large-scale distributed systems.
Data Ecosystem Expertise: Mandatory experience (2+ years) with Azure and Databricks (especially workspace administration and cluster optimization).
Programming and Automation: Proficient in Python for building automation tools and scripts.
Big Data Troubleshooting: Deep knowledge of debugging Apache Spark jobs, analyzing bottlenecks in Delta Lake and cloud networking.
Observability: Experience with tools such as Azure Monitor, Grafana, Prometheus or Datadog for creating intelligent alerts.
Proven experience with Azure and Databricks is desirable.
Experience with CI/CD for Data Engineering (DataOps).
Familiarity with data governance and security (Unity Catalog).
Benefits
Flexible Meal and Food Allowance
Health Insurance
Dental Plan
Wellhub and TotalPass
Bio Ritmo gym exclusive for employees: at the Headquarters complex
Profit Sharing (PLR)
Equity Program: "Porto em Ação" — complementary to PLR until 2025
Sand and multipurpose courts: at the Headquarters complex
Transportation Voucher / Commuting Allowance
Van transportation services; available at main access stations to Porto (Luz, Barra Funda, Santa Cecília and Júlio Prestes)
Extended Parental Leave: up to 40 days for all family configurations
Extended Maternity Leave: 6 months
On-site Medical Clinic with specialties: at Headquarters and Barra Funda
Childcare or nanny subsidy
Life Insurance
Private Pension Plan - PortoPrev
Discounts on Products and Services
Tuition Assistance: reimbursement for undergraduate, graduate or MBA programs
Monthly running events: subsidy for major road races in São Paulo
Language reimbursement (English or Spanish)
Porto Theater: exclusive sessions for employees
Library
Rest room: at the Headquarters complex
Game room: at the Headquarters complex
Massage and podiatry services: at the Headquarters complex
DevOps Engineer improving productivity and efficiency of NVIDIA's Developer Tools team. Collaborating with cross - functional teams to streamline CI/CD pipelines and product delivery.
Site Reliability Engineer joining Spotify’s Backstage team, building intelligent infrastructure for the world's most popular audio streaming service. Contributing to AI - native workflows and developer experience.
Site Reliability Engineer at Swiss Re designing and improving observability platforms. Involves collaboration with IT for a seamless customer experience and system reliability.
Lead DevSecOps Engineer responsible for secure cloud infrastructure at Swiss Re. Enhancing DevSecOps practices and collaborating within an agile environment.
DevOps Engineer responsible for designing and maintaining CI/CD pipelines at LUZA Group. Collaborating with teams on infrastructure automation using Terraform and Ansible.
Design and engineer wire harnesses for vehicles at Ford, ensuring quality and on - time delivery of components. Collaborate with engineering teams and suppliers to innovate and optimize designs.
Cloud Engineer joining Technology Operations at Wells Fargo, focusing on intelligent infrastructure solutions and Kubernetes platform optimization. Responsible for cloud - native deployments and AI - driven operations.
Principal DevOps Engineer/SRE leading DevOps initiatives for multi - tenant SaaS platform. Designing standards and automating to empower product teams in operations and deployment.
Lead DevSecOps Engineer at McKesson driving cloud infrastructure and security initiatives. Focusing on GitHub workflows and Azure, mentoring team members on best practices.
DevOps Manager leading a team of engineers at FleetPartners, enhancing automation and overseeing cloud infrastructure. Working in a hybrid role to deliver optimized services and operational excellence.