About the role

Site Reliability Engineer managing Kubernetes platforms at epay, focusing on reliability and scalability. Collaborating with product teams to ensure fast, resilient, and observable services.

Responsibilities

Operate and harden SUSE Harvester environments: lifecycle management, upgrades, node/cluster health, HA, capacity planning, and incident response.
Administer Longhorn storage for Kubernetes: performance tuning, disaster‑recovery design, backup/restore validation, and troubleshooting volume issues.
Manage Kubernetes clusters (multi‑cluster, multi‑tenant) including cluster creation, upgrades, admission control, API server health, and etcd care.
Own CNI operations with Antrea: policy design, network performance, and east‑west traffic observability.
Run KubeVirt for VM workloads on Kubernetes: plan migrations, right‑size resources, and build reliable pipelines for VM lifecycle.
Use Rancher to standardize cluster fleet management: provisioning (CAPI), templates, RBAC, and centralized policy/upgrade orchestration.
Implement GitOps with FluxCD: define release pipelines, drift detection, progressive delivery, and automated rollbacks.
Provision cloud/on‑prem resources with Crossplane: compose abstractions, manage providers, and enforce guardrails for day‑2 operations.
Build and maintain SLOs/SLIs: availability, latency, error budgets; automate alerts and runbooks tied to service health.
Reduce toil through automation: scripting, operators, controllers, and self‑service tooling for developers.
Participate in on‑call rotations, post‑incident reviews, and reliability roadmaps; drive corrective actions and platform improvements.

3+ years in SRE/Platform/Systems Engineering (or equivalent) supporting production Kubernetes.
Hands‑on experience with SUSE Harvester and Longhorn or comparable HCI + distributed block storage.
Practical knowledge of Antrea CNI, KubeVirt, and Rancher fleet management.
Proficiency with FluxCD (GitOps patterns, Kustomize/Helm) and Crossplane (Compositions, Providers, RBAC).
Strong Linux administration (networking, filesystems, performance), observability (logs/metrics/traces), and scripting (Bash/Python).
Networking fundamentals (TCP/IP, L4/L7), Kubernetes networking/policies, TLS/cert management.
Experience designing for HA, capacity planning, backup/restore, and disaster recovery.
Nice to have
Experience with CAPI/Cluster API, RKE2/k3s, CSI drivers, and hardware lifecycle (firmware, BMC).
Familiarity with service meshes (e.g., Istio/Linkerd), policy engines (OPA/Gatekeeper), and secrets management.
Infrastructure automation (Terraform/Ansible) and CI/CD (GitHub Actions, GitLab CI, Azure DevOps).
Prior ownership of SLO programs and error‑budget policies.