Hybrid Staff Site Reliability Engineer

Posted 10 hours ago

Apply now

About the role

  • Site Reliability Engineer at Zefr applying cloud infrastructure expertise, collaborating on ML applications and fostering DevOps culture. Building scalable systems for responsible marketing in social environments.

Responsibilities

  • Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
  • Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
  • Collaborate with other engineers, particularly the Machine Learning team, to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.
  • Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
  • Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
  • Participate in 24/7 on-call rotation, respond to system performance issues and outages.
  • Debug code at the application and infrastructure level.
  • Mature our CI/CD workflows and release process.
  • Maintains a forward-thinking approach, actively researching and proposing new solutions.
  • Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.

Requirements

  • 7+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
  • Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
  • Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
  • Production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters
  • Strong problem-solving experience, focusing on automation
  • Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.
  • Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.
  • Knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.
  • Strong written and verbal communication, organization, and documentation skills

Benefits

  • Flexible PTO
  • Medical, dental, and vision insurance with FSA options
  • Company-paid life insurance
  • Paid parental leave
  • 401(k) with company match
  • Professional development opportunities
  • 10+ paid holidays off
  • Summer Fridays (we leave early)
  • In-office, hybrid, and fully-remote work options available
  • In-office lunches and lots of free food
  • Optional in-person and virtual events (we like to celebrate!)

Job title

Staff Site Reliability Engineer

Job type

Experience level

Lead

Salary

$190,000 - $210,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job