Senior DevOps Engineer ensuring reliable and automated GCP-native infrastructure at Search Atlas. Collaborating across teams to enhance observability and streamline deployment processes.
Responsibilities
Administer and scale our GKE clusters to support a wide range of microservices, with a strong focus on resilience, cost optimization, and performance.
Contribute to the development of GitLab CI/CD pipelines and deployment automation with ArgoCD, supporting secure and traceable infrastructure workflows.
Manage infrastructure across GCP using Terraform, ensuring all systems are modular, scalable, and reproducible.
Implement and extend OpenTelemetry instrumentation. Build and maintain dashboards and alerts using Grafana, and Sentry to detect and resolve issues quickly.
Support PostgreSQL, Elasticsearch, and ClickHouse operations, helping monitor performance, ensure uptime, and reduce cost across data layers.
Participate in on-call rotations, troubleshoot production issues, and contribute to disaster recovery and high-availability strategies.
Partner with backend, frontend, and QA teams to improve infrastructure reliability, streamline deployments, and ensure platform stability.
Requirements
5+ years of experience in DevOps or SRE roles working in production environments.
Strong proficiency with Kubernetes (GKE preferred) and GitOps workflows using ArgoCD.
Deep knowledge of GCP infrastructure and Terraform-based IaC practices.
Experience with OpenTelemetry for distributed tracing and instrumentation.
Expertise in Grafana, Datadog, and Sentry for observability and monitoring.
Operational knowledge of PostgreSQL, Elasticsearch, and ClickHouse.
Strong troubleshooting skills and experience with incident resolution in production systems.
Effective communication skills and ability to collaborate across teams.
Benefits
15 Days Paid Time Off + Christmas Day + New Year's Day Paid Off
Senior Site Reliability Engineer focusing on resilience and reliability for EarnIn's financial products. Collaborating with teams to enhance system availability and observability in production environments.
Site Reliability Engineer managing application operations and DevOps for aviation industry client. Collaborating on digital challenges and ensuring system reliability in hybrid environments.
Intern assisting in developing a release management tool for SES's Software Center of Expertise. Working with Golang, APIs, and CI/CD processes in Luxembourg.
Site Reliability Engineer responsible for the reliability of production systems at Modulate. Leading monitoring and incident response efforts as part of a growing engineering team.
Machine Learning Engineer responsible for designing and maintaining ML infrastructure on AWS at Roche. Key role in revolutionizing drug discovery using machine learning techniques with a close - knit team.
Senior Site Reliability Engineer operating scalable services in Azure and Kubernetes environments with a focus on reliability and performance improvements.
HPC Architect designing and optimizing high - performance computing solutions for semiconductor equipment. Collaborating with cross - functional teams to enhance compute workload capabilities.
Senior Site Reliability Engineer ensuring reliability, automation, and observability across cloud infrastructure. Focused on building self - service tools and improving performance in fast - paced environments.
Maintenance and Reliability Engineer optimizing preventive maintenance at VistaPrint's automated production facility in Venlo. Collaborating with cross - functional teams to drive continuous improvement in maintenance practices.
Senior Site Reliability Engineer at Five9 designing Kubernetes on bare metal and hypervisor platforms within private cloud environments. Responsible for architecture, design, and standardization in infrastructure and automation.