Site Reliability Engineer managing Kubernetes platforms at epay, focusing on reliability and scalability. Collaborating with product teams to ensure fast, resilient, and observable services.
Responsibilities
Operate and harden SUSE Harvester environments: lifecycle management, upgrades, node/cluster health, HA, capacity planning, and incident response.
Administer Longhorn storage for Kubernetes: performance tuning, disaster‑recovery design, backup/restore validation, and troubleshooting volume issues.
Manage Kubernetes clusters (multi‑cluster, multi‑tenant) including cluster creation, upgrades, admission control, API server health, and etcd care.
Own CNI operations with Antrea: policy design, network performance, and east‑west traffic observability.
Run KubeVirt for VM workloads on Kubernetes: plan migrations, right‑size resources, and build reliable pipelines for VM lifecycle.
Use Rancher to standardize cluster fleet management: provisioning (CAPI), templates, RBAC, and centralized policy/upgrade orchestration.
Implement GitOps with FluxCD: define release pipelines, drift detection, progressive delivery, and automated rollbacks.
Provision cloud/on‑prem resources with Crossplane: compose abstractions, manage providers, and enforce guardrails for day‑2 operations.
Build and maintain SLOs/SLIs: availability, latency, error budgets; automate alerts and runbooks tied to service health.
Reduce toil through automation: scripting, operators, controllers, and self‑service tooling for developers.
Participate in on‑call rotations, post‑incident reviews, and reliability roadmaps; drive corrective actions and platform improvements.
Requirements
3+ years in SRE/Platform/Systems Engineering (or equivalent) supporting production Kubernetes.
Hands‑on experience with SUSE Harvester and Longhorn or comparable HCI + distributed block storage.
Practical knowledge of Antrea CNI, KubeVirt, and Rancher fleet management.
Proficiency with FluxCD (GitOps patterns, Kustomize/Helm) and Crossplane (Compositions, Providers, RBAC).
Strong Linux administration (networking, filesystems, performance), observability (logs/metrics/traces), and scripting (Bash/Python).
DevOps Engineer integrating and managing Big Data applications for clients in an innovative AI company. Working with advanced technologies in a hybrid environment.
Security Lead managing security strategy and operations for PetroApp's technology platform. Overseeing cloud infrastructure and DevSecOps practices to ensure security and reliability.
Azure Cloud Engineer managing Azure services for a frontier AI data foundry. Troubleshooting and optimizing cloud environments with hands - on expertise in Azure architecture.
Network Technician 3 providing critical technical support for network and infrastructure development. Involves troubleshooting and preparing technical documentation for hardware and configurations.
DevOps Engineer supporting automation, CI/CD and infrastructure management at IP Fabric. Collaborate with teams to enhance practices and ensure smooth operation of services.
Directeur développement et exploitation des CNR responsible for managing operations and compliance at multiple sites in Pantin and Marcoussis. Leading strategic facility management and commercial development efforts.
DevOps Engineer responsible for understanding requirements, implementing tools, and managing project activities. Focus on automation, security measures, and collaboration with stakeholders.
Senior DevOps Engineer overseeing continuity of SaaS services for Safran Passenger Innovations. Collaborating on software applications and innovations in the in - flight entertainment ecosystem.
Student DevOps Engineer working on data and analytics for technology solutions at Sun Life. Collaborating with teams in a supportive environment to innovate and make an impact.