Site Reliability Engineer focused on designing and maintaining observability platform for dLocal. Collaborating with global teams and optimizing system performance for major clients.
Responsibilities
Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across the three main signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion.
Empower Engineering Teams: Build self-service automation and tooling that enables development teams to instrument and leverage observability without requiring manual intervention from the SRE team.
Support Incident Management: Be the Engineering side of our Incident Management Team, designing the processes, playbooks, checklists, and automations for them and other engineers to follow during an incident.
Collaborate Across Teams: Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements.
Automate Observability Infrastructure: Leverage Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and our observability configurations across OTEL Pipelines.
Define Baseline Observability Standards: Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level.
Own Technical and Security Health: Take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs.
Optimize Alerting Systems: Continuously refine alerting signals to minimize noise and ensure they are always actionable, reducing fatigue and improving response efficiency.
Requirements
Over 4 years’ of experience as SRE Engineer or in a very similar role more focused on observability.
Expertise in Kubernetes, including its core components, deployment methodologies, and monitoring best practices.
Some understanding of OpenTelemetry, including setting up OTEL collectors, instrumentation, and pipeline optimization.
Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog.
Hands-on experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar).
AI Development Operations Engineer responsible for the internal AI infrastructure empowering developers. Integrating AI systems into engineering workflows for efficient software design and maintenance.
Reliability Engineer responsible for availability and performance of U.S. Air Force Cloud services. Collaborates with teams to deliver reliable mission - critical systems in a hybrid environment.
Entry - level DevOps Engineer assisting in cloud infrastructure automation for AI - powered security operations platform. Seeking passionate candidates with foundational knowledge in Terraform, Kubernetes, and CI/CD pipelines.
DevSecOps Engineer maintaining CI/CD security pipelines at SQA Consulting. Collaborating with teams to automate processes and ensure security best practices are followed.
DevSecOps Engineer for SQA Consulting focusing on CI/CD automation and security hardening. Collaborating with teams on cloud solutions in a hybrid work environment.
DevSecOps Engineer responsible for security in CI/CD pipelines for a global client network. Collaborating on security hardening of applications and automation processes.
DevSecOps Engineer managing CI/CD pipelines and ensuring application security for SQA Consulting. Collaborating across teams while focusing on continuous improvement and automation in cloud environments.
Staff Site Reliability Engineer focused on product engineering for Civica. Leading technical practices and architectural alignment while improving service delivery and quality.
Senior Cloud Operations Engineer at CELUM focusing on cloud infrastructure and system security. Collaborating on IT projects and optimizing hosting environments.
DevOps Engineer at FormativGroup focusing on Kubernetes management and automation solutions. Designing, implementing, and securing infrastructure for efficient application deployment in a remote setting.