Hybrid Site Reliability Engineer

Posted 3 days ago

Apply now

About the role

  • Senior Site Reliability Engineer responsible for maintaining critical infrastructure for Cloaked's privacy platform. Leading incident responses and ensuring reliability and scalability as the company grows.

Responsibilities

  • Define and maintain SLOs/SLAs that balance user experience with engineering velocity
  • Implement comprehensive monitoring and alerting in Datadog to detect production issues
  • Build resilient architectures that gracefully handle failures
  • Establish error budgets and use them to make data-driven decisions about feature velocity vs. stability
  • Lead incident response as primary on-call for infrastructure, taking critical load off leadership
  • Conduct thorough, blameless post-mortems to prevent recurrence
  • Build and maintain runbooks that enable faster resolution
  • Serve as the first line of defense when production issues occur
  • Identify and eliminate repetitive manual work through intelligent automation
  • Build self-healing systems that reduce operational burden
  • Improve deployment pipelines for faster, safer releases
  • Own reliability for a platform running on AWS with Kubernetes, ArgoCD, Cloudflare, and Terraform.

Requirements

  • Solid experience with Infrastructure as Code (Terraform, CloudFormation, or similar) — building and managing cloud infrastructure programmatically
  • Deep Kubernetes experience beyond basic deployments — networking, resource management, storage, security contexts, and debugging complex cluster issues
  • Proficiency in Python, Go, or similar languages; you build tools and automate workflows that others want to use
  • Experience with CI/CD pipelines (GitLab CI, GitHub Actions, Jenkins) and deployment strategies (blue-green, canary, rolling)
  • Expertise with observability tools (Datadog, Prometheus, Grafana) — distributed tracing, metrics, and log aggregation
  • Strong Linux/Unix administration background with deep system-level understanding
  • Production experience with major cloud providers (AWS, GCP, Azure) — networking, compute, storage, and managed services
  • Operational experience with databases (SQL and NoSQL) and their performance characteristics
  • Deep knowledge of network protocols and troubleshooting (TCP/IP, DNS, HTTP/S, load balancing)
  • Ability to define meaningful SLIs/SLOs, calculate error budgets, and make data-driven reliability decisions
  • Experience running incidents under pressure, writing post-mortems that drive change, and implementing preventive measures
  • Ability to profile applications, identify bottlenecks, and optimize resource utilization
  • Experience with capacity planning — forecasting growth and scaling infrastructure proactively
  • Clear communication of complex technical issues to non-technical stakeholders
  • Documentation that others actually want to read.

Benefits

  • Cloaked employees have 401K, as well as top of the line Health, Dental, and Vision benefits.
  • We offer flexible work arrangements and the ability to work remotely as needed.
  • Cloaked provides a home office stipend in addition to a new company laptop (and other tech depending on the role).
  • 🌴Competitive PTO: We encourage employees to take a minimum # of vacation per quarter.
  • 🤸Monthly health stipend: Used for any kind of physical, mental or emotional care you’d like to take for yourself.
  • 🥗 Late Night Meals: We offer employees a monthly meal stipend to be used when they don't have time to get a home cooked meal going!
  • 🧠 Professional Growth: Opportunities for career development and personal growth are provided to all employees.

Job title

Site Reliability Engineer

Job type

Experience level

Mid levelSenior

Salary

$210,000 - $260,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job