Site Reliability Engineer ensuring reliability and performance of data platform services for Veepee. Collaborating on cloud migration, Kubernetes operations, and observability best practices.
Responsibilities
Ensure the reliability and performance of our data platform services (Trino, Iceberg, S3, Kafka, Flink)
Define and implement SRE best practices: SLIs/SLOs, error budgets, and observability
Build and maintain monitoring, alerting, and incident response frameworks (Prometheus, Grafana, etc.)
Contribute to the migration from a public cloud data warehouse to VeepeeCloud’s lakehouse stack
Support coexistence between cloud and on-prem systems and ensure data consistency and service reliability
Help design resilient architectures for ingestion, transformation, and serving layers
Operate and improve services running on Kubernetes (GKE/EKS and on-prem clusters)
Automate infrastructure provisioning using Terraform, Atlantis, and/or Crossplane
Improve GitOps workflows for platform deployment and configuration
Collaborate with teams to optimize compute and storage usage (Trino queries, BigQuery slots, etc.)
Build tools and dashboards to track cost, usage, and efficiency
Support the transition toward cost-efficient on-prem workloads
Improve self-service capabilities for data teams (e.g., provisioning Trino/Iceberg resources)
Help teams adopt best practices in reliability, observability, and deployment
Write clear technical documentation and runbooks
Contribute to the definition and implementation of the Disaster Recovery Plan (DRP)
Ensure multi-DC resilience (FR1 / NL1) and implement data replication strategies
Participate in incident management and postmortems
Requirements
Strong experience with Kubernetes in production environments
Experience with distributed data systems (or a strong willingness to learn)
Solid understanding of SRE principles (monitoring, alerting, SLAs/SLOs)
Experience with Infrastructure as Code (Terraform or similar tools)
Familiarity with GitOps workflows
Experience with observability tools (Prometheus, Grafana, logging systems)
Comfortable working in cloud environments
Strong collaboration mindset and the ability to work across teams
Fluent in English
Benefits
Variable bonus
Dynamic and creative environment within international teams
Access to a variety of self-learning courses on our e-learning platform
Opportunity to participate in local and international meetups and conferences
Flexible office policy with up to 3 days remote work per week
DevOps/MLOps Engineer designing, automating, and maintaining scalable infrastructure for federal client. Collaborating with software engineers and data scientists for resilient solutions.
Senior DevSecOps Engineer/Developer responsible for building Humana's software security platform. Modernizing architecture and managing CI/CD pipelines as part of core engineering team.
Senior Information Security Analyst focusing on DevSecOps for Unidas, a major mobility company in Brazil. Responsible for optimizing security governance processes and delivering secure software.
DevOps Manager overseeing scaling for Seekr's AI platform using Kubernetes, Terraform, and Ansible. Leading a hands - on team and collaborating with engineering for efficiency.
Back - End & DevOps Software Developer contributing to building digital products to change the world. Specializing in back - end development and command of DevOps ecosystem for robust infrastructure.
Lead DevOps Developer at Boeing, focusing on CI/CD and cloud infrastructure management. Collaborating with teams to automate processes and improve system performance across environments.
Vulnerability & Configuration Management Engineer responsible for vulnerability management and remediation processes at Relax Gaming. Collaborate with IT teams to improve security measures across various platforms.
DevOps Engineer for designing and maintaining Azure - based hybrid cloud infrastructure for a company specializing in nature - based smart city solutions. Leading cloud architecture and mentoring engineers as part of a high - impact team.
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.