Staff Site Reliability Engineer managing global infrastructure for NordVPN, automating systems and ensuring service reliability.
Responsibilities
Deliver projects on time: Plan, delegate, execute, and oversee key projects;
Collaborate: Work closely with stakeholders and other teams. Mentor colleagues and lead knowledge transfer;
Ensure quality and reduce technical debt: Deliver solutions with solid design and address blockers, toil, and debt to keep systems healthy;
Drive engineering excellence: Aim for quality and choose the right solution for the problems we face;
Protect solution quality: Ensure designs are implemented with proper quality and minimal tech debt;
Data‑backed decisions: Help teams and stakeholders navigate data and act on insights;
Design and maintain highly available, scalable infrastructure with monitoring, alerting, and anomaly detection;
Automate everything: Create and optimize automation to streamline deployments, improve speed, and cut manual work;
Solve complex issues: Troubleshoot, debug, and resolve critical issues in complex systems;
Use AI: Integrate AI into workflows and processes to speed up delivery and reduce toil.
Requirements
Observability: Experience with monitoring tools and frameworks to ensure system observability (OpenSearch, VictoriaMetrics, Prometheus, Thanos, Mimir, OpenTelemetry, Nagios);
Databases and storage systems: Experience operating highly available SQL, NoSQL databases, and object stores at scale (MySQL, Percona, PostgreSQL, Cassandra, ClickHouse, Timescale, Druid, MinIO);
Data visualization: Ability to build meaningful dashboards that show the right insights (Grafana, OpenSearch Dashboards);
Alerting and anomaly detection: Ability to build anomaly detection and alerting pipelines;
Programming: Proficiency in one or more programming languages for automation scripts and integrations (Python, Go, Rust, C);
Linux: Strong knowledge of Linux systems, especially Debian‑based distributions;
Workflow: Ability to use workflow automation frameworks (Airflow, Prefect, n8n);
Configuration management: Ability to design and develop configuration management codebases and deployment pipelines (SaltStack, Ansible, Rundeck);
Networking: Strong understanding of networking protocols and concepts (Overlay, VPN, Proxy, DNS, HTTP, SSL, TCP, UDP);
Security: Ability to design secure systems and working knowledge of security concepts and tools (Vault, PKI, mTLS).
DevOps Engineer responsible for managing Microsoft Intune operations at Bundesdruckerei GmbH. Focused on ensuring secure digital solutions for identity and data protection in Berlin.
Senior Site Reliability Engineer driving observability and reliability for business - critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.
DevSecOps Specialist securing the software development lifecycle at Vanguard. Collaborating with teams to improve application security tooling and processes, and provide development guidance.
Site Reliability Engineer automating infrastructure deployment for Scaleway's sovereign cloud products. Collaborating with product teams to enhance observability and reliability of the platform.
Reliability Engineer responsible for equipment reliability and safety using data - driven analysis for Wood in Aberdeen. Focus on proactive maintenance and operational efficiency.
Principal Safety and Reliability Engineer developing and supporting safety design for mission - critical aerospace systems. Engaging in design reviews and ensuring compliance with requirements.
Cloud DevOps Engineer playing a pivotal role in developing migration plans for Coast Guard Cloud Architecture. Collaborating with teams to ensure effectiveness and best practices in cloud implementation.
Reliability Engineer III at Daimler Truck developing propulsion technology solutions for electrified and conventional axle components. Leading testing and validation for complex powertrain systems.
Electrical Reliability Engineer at Marathon Petroleum maintaining electrical equipment and systems. Collaborating with cross - functional teams and ensuring compliance with electrical codes and standards.