Site Reliability Engineer for Leidos ensuring reliability, performance, and scalability of complex distributed systems for the Navy-Marine Corps Intranet. Collaborating with teams to maintain and optimize network operations and services.
Responsibilities
Work alongside the development and operations teams to ensure speedy and reliable software deployments, monitor systems, and improve overall reliability of the platform, create change management tickets, and perform break fix of network appliances.
Develop features utilize the AI coding tool and repository of scripts to automate, scale, test, and secure the cloud infrastructure and the pipelines.
Enhance performance monitoring of the various systems via Splunk or other dashboard reporting tools.
Identify performance bottlenecks and optimize the performance of cloud infrastructure.
Contribute to continuing our SRE journey by suggesting ways to improve engineering build, maintenance, automation and reliability across the platform with SRE/DevOps tools and Infrastructure-as-Code.
Develop and code high-quality pipeline automation workflows to support inside and outside the cloud platform that are appropriate for business and technology strategies.
Develop and execute test strategies that simulate real-world failure scenarios, including network disruptions, hardware failures, and system overloads.
Create, script, and run performance tests to measure system behavior under varying levels of load and traffic.
Identify bottlenecks, performance degradation, and areas for optimization.
Design, implement, and maintain automated test suites for infrastructure and application components.
Ensure that testing is integrated into the CI/CD pipeline to validate system reliability with every release.
Build automated systems for continuous performance testing, stress testing, and load testing.
Work closely with SREs, developers, and operations teams to define reliability goals and develop appropriate testing strategies to validate those goals.
Ensure that new services and features undergo thorough testing for performance, reliability, and failure recovery before deployment to production.
Validate that monitoring, logging, and alerting mechanisms are functioning correctly by testing systems under failure conditions.
Ensure that Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are accurately measured and tracked through automated testing frameworks.
Resolve most conflicts between timeline, budget, and scope independently but intuitively raise sophisticated or consequential issues to senior management.
Requirements
Requires BS degree and 8-12 years of prior relevant experience.
Currently possess and ability to maintain an active DoD Secret security clearance and be eligible for a top secret.
Minimum of DoD 8570.01 IAT Level II Certification required prior to onboarding and must maintain certification while supporting the SMIT Contract.
Must be able to support program execution in classified environments and access SIPRNet from an NMCI location on short notice (local travel).
Must have a vendor certification e.g., CCNA, CCNP, Juniper, Palo Alto, Aruba etc.
Experience with automated script design, coding, debugging, and maintenance skills (using bash, python, etc.) preferred.
Experience in CI/CD toolsets (e.g. Jenkins, GitLab, etc.).
Experience with network switches, routers, VLANs, DMZ, VPN, IPS, load balancers, and FW.
Experience with SDWAN and Arista.
Experience in application administration, configuration, and integration.
Familiarity with agile development methodologies.
Skilled and disciplined to work with a distributed team.
Ability to work in a highly collaborative, forward thinking, and innovation-driven environment.
Knowledge of Agile and DevSecOps/SRE concepts and best practices, with a desire to grow that knowledge.
Hand-on experience with Atlassian products (Jira, Confluence, Bitbucket, etc.).
Senior DevOps Engineer responsible for cloud ecosystem architecture at health - tech startup. Building HIPAA/GDPR - compliant foundations and mentoring developers.
Senior Backend Engineer building product features and maintaining infrastructure for insurance platform. Employing tools like Terraform, Kafka, Datadog and Qovery with a strong DevOps focus.
DevOps Systems Engineer supporting customer operations in Annapolis Junction, MD. Responsible for creating, sustaining, and troubleshooting complex operational data flows.
OpenShift Fresher assisting Cloud team in managing containerized applications using Red Hat OpenShift. Supporting CI/CD, deployment automation, and cloud - native application environments.
DevOps Engineer evolving banking infrastructure for a fintech company. Focusing on observability, incident response, and platform automation in a hybrid work setup.
Lead Site Reliability Engineer managing critical IT systems for S&P Dow Jones Indices. Focused on service availability, incident management, and developer collaboration to enhance operational reliability.
Lead DevOps Engineer developing AI - powered supply chain intelligence solutions at S&P Global Mobility. Collaborate with data scientists and engineers to optimize operational infrastructure and continuous delivery processes.
Senior DevOps Engineer managing development and deployment pipelines for AI products at Plaud. Optimize infrastructure, enhance productivity, and collaborate with cross - functional teams.
Senior SRE Engineer ensuring reliability and performance of AI products at Plaud. Designing scalable systems and leading incident response to improve operational maturity.
DevOps Engineer supporting big data solutions and AWS infrastructure deployment at Enlighten. Collaborating with teams to ensure reliability, scalability, and performance of cloud services.