Hybrid Senior Site Reliability Engineer

Posted 2 months ago

Apply now

About the role

  • Site Reliability Engineer responsible for application reliability and security in DoD environments. Collaborating with Infrastructure & Security team to enhance service quality and operational efficiency.

Responsibilities

  • You'll own the reliability, scalability, and security of the production application and/or platform.
  • Building a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana).
  • Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Objectives (SLOs).
  • Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents.
  • Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters.
  • Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation.

Requirements

  • 3 years of experience in Site Reliability Engineering or a related field, with firsthand experience managing mission-critical systems within DoD’s air-gapped environments
  • An active Top Secret security clearance. U.S. citizenship required.
  • Experience automating software delivery, deployment, and providing documentation and self-service tools for engineering teams and customers.
  • A strong understanding of Linux, containerization and orchestration, and virtual machines
  • Experience with centralized logging, metrics, and observability using tools such as Prometheus, Loki, Grafana, ELK stack, or Datadog.
  • Networking fundamentals: core protocols and secure configurations.
  • A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement
  • Clear, concise writing; strong documentation habits and async communication.
  • Core skills and technologies: VMWare, Kubernetes, Docker, Helm, Ansible, Terraform, Linux, AWS, DoD compliance, Monitoring and Observability tools, AWS.

Benefits

  • Relocation assistance provided
  • Active Top Secret Clearance required; SCI eligibility is a plus.

Job title

Senior Site Reliability Engineer

Job type

Experience level

Senior

Salary

$180,000 - $220,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job