Site Reliability Engineer managing Kubernetes platforms at epay, focusing on reliability and scalability. Collaborating with product teams to ensure fast, resilient, and observable services.
Responsibilities
Operate and harden SUSE Harvester environments: lifecycle management, upgrades, node/cluster health, HA, capacity planning, and incident response.
Administer Longhorn storage for Kubernetes: performance tuning, disaster‑recovery design, backup/restore validation, and troubleshooting volume issues.
Manage Kubernetes clusters (multi‑cluster, multi‑tenant) including cluster creation, upgrades, admission control, API server health, and etcd care.
Own CNI operations with Antrea: policy design, network performance, and east‑west traffic observability.
Run KubeVirt for VM workloads on Kubernetes: plan migrations, right‑size resources, and build reliable pipelines for VM lifecycle.
Use Rancher to standardize cluster fleet management: provisioning (CAPI), templates, RBAC, and centralized policy/upgrade orchestration.
Implement GitOps with FluxCD: define release pipelines, drift detection, progressive delivery, and automated rollbacks.
Provision cloud/on‑prem resources with Crossplane: compose abstractions, manage providers, and enforce guardrails for day‑2 operations.
Build and maintain SLOs/SLIs: availability, latency, error budgets; automate alerts and runbooks tied to service health.
Reduce toil through automation: scripting, operators, controllers, and self‑service tooling for developers.
Participate in on‑call rotations, post‑incident reviews, and reliability roadmaps; drive corrective actions and platform improvements.
Requirements
3+ years in SRE/Platform/Systems Engineering (or equivalent) supporting production Kubernetes.
Hands‑on experience with SUSE Harvester and Longhorn or comparable HCI + distributed block storage.
Practical knowledge of Antrea CNI, KubeVirt, and Rancher fleet management.
Proficiency with FluxCD (GitOps patterns, Kustomize/Helm) and Crossplane (Compositions, Providers, RBAC).
Strong Linux administration (networking, filesystems, performance), observability (logs/metrics/traces), and scripting (Bash/Python).
DevOps Manager responsible for managing a team for multi - cloud solutions supporting the USAF Cloud One project. Focus on scalable cloud - native solutions and CI/CD practices.
Lead Site Reliability Engineer overseeing SRE practices across Azure and GCP platforms. Driving reliability improvements and leading a team at Lloyds Banking Group.
DevOps Engineer responsible for managing Microsoft Intune operations at Bundesdruckerei GmbH. Focused on ensuring secure digital solutions for identity and data protection in Berlin.
Senior Site Reliability Engineer driving observability and reliability for business - critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.
DevSecOps Specialist securing the software development lifecycle at Vanguard. Collaborating with teams to improve application security tooling and processes, and provide development guidance.
Site Reliability Engineer automating infrastructure deployment for Scaleway's sovereign cloud products. Collaborating with product teams to enhance observability and reliability of the platform.
Reliability Engineer responsible for equipment reliability and safety using data - driven analysis for Wood in Aberdeen. Focus on proactive maintenance and operational efficiency.
Principal Safety and Reliability Engineer developing and supporting safety design for mission - critical aerospace systems. Engaging in design reviews and ensuring compliance with requirements.
Cloud DevOps Engineer playing a pivotal role in developing migration plans for Coast Guard Cloud Architecture. Collaborating with teams to ensure effectiveness and best practices in cloud implementation.