Site Reliability Engineer responsible for application reliability and security in DoD environments. Collaborating with Infrastructure & Security team to enhance service quality and operational efficiency.
Responsibilities
You'll own the reliability, scalability, and security of the production application and/or platform.
Building a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana).
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Objectives (SLOs).
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents.
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters.
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation.
Requirements
3 years of experience in Site Reliability Engineering or a related field, with firsthand experience managing mission-critical systems within DoD’s air-gapped environments
An active Top Secret security clearance. U.S. citizenship required.
Experience automating software delivery, deployment, and providing documentation and self-service tools for engineering teams and customers.
A strong understanding of Linux, containerization and orchestration, and virtual machines
Experience with centralized logging, metrics, and observability using tools such as Prometheus, Loki, Grafana, ELK stack, or Datadog.
Networking fundamentals: core protocols and secure configurations.
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement
Clear, concise writing; strong documentation habits and async communication.
Core skills and technologies: VMWare, Kubernetes, Docker, Helm, Ansible, Terraform, Linux, AWS, DoD compliance, Monitoring and Observability tools, AWS.
Benefits
Relocation assistance provided
Active Top Secret Clearance required; SCI eligibility is a plus.
Manage complex customer rollouts and initial system deployments at Talex.ai. Bridging technical development with real - world application in robotics and AI systems.
Cloud Operations Engineer designing and implementing highly reliable cloud solutions. Leading cloud infrastructure initiatives for production operations and customer success in a growing team.
Quality Engineer supporting new product launches and reliability testing for SSD at Micron in Malaysia. Responsible for coordinating test activities and conducting failure analysis.
Reliability Engineer ensuring operational readiness of data centers at Rowan Digital Infrastructure. Overseeing commissioning, operational standards, and transitioning facilities into live operations.
Manager of Mechanical Engineering ensuring high - availability mechanical systems in data centers. Collaborating on lifecycle management and performance evaluation across missions - critical facilities in a hybrid role.
DevOps Engineer developing reusable Ansible and Puppet modules and managing CI/CD for project teams. Join PLATH in Hamburg, focusing on crisis detection software development.
Senior DevOps Engineer designing and maintaining CI/CD pipelines for a leading connectivity firm. Collaborating with cross - functional teams to optimize cloud infrastructure and enhance operational excellence.
Mechanical Reliability Engineer at Cargill ensuring asset reliability through advanced maintenance practices. Collaborating with teams and overseeing projects in heavy industrial processes.
Sr. DevOps Engineer at AllTrails focused on enhancing infrastructure reliability and security. Collaborating with engineering teams and contributing to projects for system optimization.