About the role

  • Site Reliability Engineer improving reliability of cloud communications technology. Building monitoring solutions with a focus on operational readiness across Windows and Linux environments.

Responsibilities

  • Build and operate metrics/monitoring platforms: **Prometheus and/or VictoriaMetrics** (scrape configs, exporters, recording rules)
  • Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction
  • Integrate monitoring/alerting and events with **BigPanda** (correlation, enrichment, routing, incident workflows)
  • Create and maintain dashboards and operational visibility (Grafana or equivalent)
  • Develop and maintain runbooks, operational playbooks, and incident response procedures
  • Participate in **on-call shifts**: triage alerts, manage incidents, coordinate response, and lead communication during outages
  • Perform root-cause analysis, postmortems, and implement corrective/preventive actions
  • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
  • Support monitoring for core infrastructure and services on **Windows and Linux**, including HA components and clusters
  • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)

Requirements

  • Bachelor in Computer Science or related field
  • Experience in **SRE / Operations / DevOps** with production incident ownership
  • Hands-on experience with **Prometheus and/or VictoriaMetrics** (exporters, alert rules, recording rules, troubleshooting)
  • Experience integrating alerting/event pipelines with **BigPanda** (or similar event correlation tools)
  • Strong troubleshooting skills across **Linux and Windows** systems (networking, OS, services)
  • Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
  • Experience with Git-based workflows for monitoring-as-code and configuration management
  • ****Nice to have******
  • Grafana administration and dashboard design standards
  • Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
  • Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
  • Messaging/cache/proxy operations: **RabbitMQ**, **Redis**, **Nginx**
  • Experience with Windows clustering or HA environments
  • Experience defining SLOs/SLIs and operational KPIs
  • Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)
  • Experience with load balancing components ( F5 LTM, F5 GTM)
  • Experience with Virtualization platforms such as VMWare or HyperV
  • Experience with administering AWS or Azure tenants

Benefits

  • *We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or any other basis protected by applicable law (collectively referred to in our Code of Conduct as “Protected Classes”). We do not tolerate employment discrimination in the workplace, and we are committed to making reasonable accommodations for identified disabilities or other limitations as required by all applicable laws. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.*
  • **

Job title

Site Reliability Engineer

Job type

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

HybridPortugal

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job