Site Reliability Engineer managing Kubernetes platforms at epay, focusing on reliability and scalability. Collaborating with product teams to ensure fast, resilient, and observable services.
Responsibilities
Operate and harden SUSE Harvester environments: lifecycle management, upgrades, node/cluster health, HA, capacity planning, and incident response.
Administer Longhorn storage for Kubernetes: performance tuning, disaster‑recovery design, backup/restore validation, and troubleshooting volume issues.
Manage Kubernetes clusters (multi‑cluster, multi‑tenant) including cluster creation, upgrades, admission control, API server health, and etcd care.
Own CNI operations with Antrea: policy design, network performance, and east‑west traffic observability.
Run KubeVirt for VM workloads on Kubernetes: plan migrations, right‑size resources, and build reliable pipelines for VM lifecycle.
Use Rancher to standardize cluster fleet management: provisioning (CAPI), templates, RBAC, and centralized policy/upgrade orchestration.
Implement GitOps with FluxCD: define release pipelines, drift detection, progressive delivery, and automated rollbacks.
Provision cloud/on‑prem resources with Crossplane: compose abstractions, manage providers, and enforce guardrails for day‑2 operations.
Build and maintain SLOs/SLIs: availability, latency, error budgets; automate alerts and runbooks tied to service health.
Reduce toil through automation: scripting, operators, controllers, and self‑service tooling for developers.
Participate in on‑call rotations, post‑incident reviews, and reliability roadmaps; drive corrective actions and platform improvements.
Requirements
3+ years in SRE/Platform/Systems Engineering (or equivalent) supporting production Kubernetes.
Hands‑on experience with SUSE Harvester and Longhorn or comparable HCI + distributed block storage.
Practical knowledge of Antrea CNI, KubeVirt, and Rancher fleet management.
Proficiency with FluxCD (GitOps patterns, Kustomize/Helm) and Crossplane (Compositions, Providers, RBAC).
Strong Linux administration (networking, filesystems, performance), observability (logs/metrics/traces), and scripting (Bash/Python).
Sr. Site Reliability Engineer designing and automating robust technical infrastructure at Broadridge. Collaborating across teams for successful deployment and operational support of services.
Senior Fleet Reliability Engineer maintaining high fleet uptime for autonomous vehicle technology. Collaborating with technical teams to ensure peak operational performance in data collection efforts.
DevOps Lead at Leidos managing platform engineering, SRE, and application security functions. Driving operational excellence and ensuring scalability for federal government applications.
SRE Lead developing scalable cloud - native solutions for mission - critical systems supporting USAF. Managing teams, collaborating with cross - functional units, and ensuring high service reliability standards.
Junior DevOps / Platform Engineer at DieEnergiekoppler GmbH managing AWS/EKS platform operations. Collaborating with team members to improve platform functionalities and security compliance.
DevOps Engineer responsible for AWS infrastructures and backend development at Allguth GmbH. Engaging in greenfield projects with modern solutions in a collaborative team.
Cloud DevOps Specialist responsible for building scalable infrastructure solutions in AWS at SONDA. Focusing on automation, containerization, and data management in a collaborative environment.
DevOps Engineer maintaining and evolving deployment pipelines for Docebo’s AI - powered learning platform. Collaborating with cross - functional teams to ensure efficient software releases and infrastructure management.
DevOps Engineer optimizing CI/CD pipelines for Docebo, an AI - powered learning platform. Involves managing multi - tenant infrastructure using AWS, Docker, and Kubernetes.
DevOps Engineer maintaining and automating infrastructure and CI/CD processes for cybersecurity solutions by NordLayer. Collaborating with teams to ensure performance and scalability of cloud services.