SRE Technical Manager leading reliability engineering teams ensuring performance for Navy IT services. Manage teams, collaborate on automation, and drive continuous improvement in a critical systems environment.
Responsibilities
Manage and mentor 5-6 SRE teams (pods) and 60+ FTEs, providing guidance, setting performance expectations, and fostering professional development.
Work collaboratively with SRE Resource Managers to staff and maintain engineering resources for your SRE vertical teams' reliability and scalability goals.
Responsible for the P&L across the Transport Services vertical. Manage the SRE team’s resources, including budget planning, tool selection, and infrastructure investments to meet reliability and scalability needs.
Meet regularly with your team members, participate in performance reviews and interviews, and development planning.
Oversee the reliability, availability, and performance of critical systems by leading the SRE teams within the data center vertical in implementing monitoring, incident response, and performance optimization strategies.
Ensure the team adheres to best practices for system reliability, automation, and operational efficiency.
Drive continuous improvement initiatives by analyzing performance metrics (e.g., SLOs, MTTR, MTBF) and identifying areas for enhancement.
Collaborate with operations, quality, cybersecurity and other SRE engineering teams to define and enforce Service Level Objectives (SLOs) and manage error budgets.
Act as a liaison between the SRE team and other departments to prioritize reliability and operational needs in the product development process.
Collaborate with senior leadership to define the SRE strategy, set long-term reliability goals, and ensure alignment with business objectives.
Lead efforts to reduce operational toil through automation. Work with the team to build or enhance automation tools that manage infrastructure, monitor systems, and respond to incidents.
Oversee the development and adoption of Infrastructure as Code (IaC) tools, CI/CD pipelines, and other automation processes.
Ensure that SRE practices align with organizational security policies and compliance requirements.
Collaborate with security teams to integrate reliability-focused security practices into the design and operation of systems.
Ensure systems meet or exceed agreed-upon service levels by proactively addressing potential issues and working with stakeholders to align on reliability expectations.
Work within a SRE team, collaborating with other Developers, Security, and Operations, to continuously deliver products and increase the value stream for the organization and customers.
Embrace and champion Agile development processes and adoption to modern Site Reliability Engineering workflows and practices while providing technical guidance to team members and coworkers on best practices.
Stay up to date on the latest Site Reliability Engineering practices and technologies.
Strive to provide internal and external customers with excellent customer service and world-class service.
Resolve most conflicts between timeline, budget, and scope independently but intuitively raise sophisticated or consequential issues to senior management.
Requirements
Requires B.S. Degree (or equivalent) in Cybersecurity, Information Security, IT, Network Engineering, Computer Science, or related field or Master's with 6+ years of prior relevant experience with 8-10 years of SRE or DevOps experience and at least 4 years in a leader or manager capacity.
US Citizen with DoD Secret Clearance.
Minimum of DoD 8570.01 IAT Level II Certification required prior to onboarding and must maintain certification while supporting the SMIT Contract.
Must be able to support program execution in classified environments and access SIPRNet from an NMCI location on short notice (local travel).
Exceptional written and oral communication skills include producing technical analysis/reports, presentations and executive level briefings with internal and external stakeholders.
Ability to review requirements, comprehend, and solution capabilities that satisfy customer requirements.
Ability to work in a highly collaborative, forward thinking, and innovation-driven environment.
Proven experience managing teams responsible for large-scale, distributed systems with high reliability and performance demands.
Strong track record of managing incidents, conducting postmortems, and implementing reliability improvements.
Experience implementing and managing Agile or DevOps processes, with a focus on continuous improvement, efficiency, and team productivity.
Ability to lead teams through strategic initiatives such as reliability maturity assessments, process automation, and tooling selection.
Solid understanding of SRE principles, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgeting.
Experience with commercial cloud infrastructure deployment environments such as AWS and Azure.
Strong knowledge of automation tools, CI/CD pipelines, and Infrastructure as Code (IaC).
Experience with Agile and DevSecOps/SRE concepts and best practices.
Hand-on experience with Atlassian products (Jira, Confluence, Bitbucket, etc.).
Dev Ops Engineer at DATAGROUP in Rostock managing IT applications and cloud technologies. Collaborating with teams to support client IT transformations in a flexible work environment.
DevOps Engineer responsible for optimizing and securing cloud deployment processes at Axi. Collaborating across technology teams to promote best practices in DevOps methodologies.
Azure Cloud Engineer ensuring safe and scalable cloud environment at Schoologica while contributing to innovative educational solutions with modern cloud technologies.
DevSecOps Engineer responsible for enhancing Thales' secure hosting platforms in public and private clouds. Collaborating with teams to apply modern practices and build resilient infrastructures.
Develops high - automation services in Golang or Java within AWS, Kubernetes, and Azure. Supports teams in building secure applications while working in a hybrid environment.
DevOps Engineer specializing in AWS Cloud Infrastructure in a hybrid position. Collaborating within a supportive team to build modern infrastructure for VM - based applications.
Leading DevOps platform strategy for KIPMI Software's next - generation digital trust products. Collaborating with teams to implement scalable infrastructure and DevSecOps practices.
Join our DevOps team to build and manage GitHub pipelines and cloud - native Azure solutions. Collaborate with teams to drive DevOps best practices and optimize deployments.
Site Reliability Engineer enhancing system reliability and deployment practices at OpenLoop. Collaborating with cross - functional teams for incident management and performance tuning.