Technical Staff leading the architecture, reliability, and modernization of enterprise ALM and DevOps tools. Driving strategy and influencing product development in collaboration with various teams.
Responsibilities
Lead the architecture, reliability, and modernization of our enterprise ALM and DevOps tool ecosystem
Define and evolve HA, DR, and scaling architectures across all ALM tools
Build topology-aware designs and continuously measure and improve platform scalability, performance, and resilience
Ensure all tool services meet strict requirements for availability, reliability, and stability
Define and drive SLIs, SLOs, error budgets, and operational KPIs for every tool
Implement applied observability: actionable metrics, logs, traces, alerts, and dashboards tailored to each platform
Lead root-cause analysis, incident management, and continuous reduction in MTTD/MTTR
Architect Okta integrations across tools: SAML/OIDC/SCIM, entitlement frameworks, group/role mapping, and auditability
Ensure compliance with SRO/Security controls: hardening, secrets management, vulnerability remediation, and audit readiness
Drive the HVA program uplift for tools: threat modeling, compensating controls, and DR testing
Participate in PQC-readiness assessments and roadmap planning
Lead automation of infrastructure, upgrades, backup/restore, and operational workflows using Terraform, Ansible, GitOps and tooling APIs
Create repeatable, consistent patterns for build pipelines (Jenkins/GitHub Actions) and artifact governance (Artifactory)
Work directly with SMEs and Engineering teams to standardize CI/CD, improve reliability, and reduce operational toil
Influence architecture and platform decisions across Engineering, Infrastructure, and Security teams
Mentor Architects, Principals, and Staff Engineers; create reference architectures and operational best practices
Partner with vendors and internal teams to evolve capabilities and ensure long-term platform health.
Requirements
15+ years with enterprise-scale DevOps/ALM platforms; deep expertise in several core tools (Jira, GHES, Jenkins, Artifactory, Kafka, SonarQube, Coverity, qTest)
Demonstrated ability to design and operate HA/DR architectures and deliver 99.95%+ uptime systems
Strong background in SRE/SRO, applied observability, performance engineering, capacity planning, and scaling large tool deployments
Hands-on experience with Okta, identity federation, SCIM, authorization models, and enterprise entitlement design
Solid foundation in Linux, networking, containers, Kubernetes, databases, and distributed systems.
Benefits
Your life. Your health. Supported by your benefits.
Site Reliability Engineer responsible for reliability and availability, collaborating with development teams on scalable systems. Applying software engineering practices to improve production operations.
DevOps Engineer in the Security Data and AI Lab at Lloyds Banking Group driving data and cloud infrastructure's influence on product operations and customer service improvements.
Senior Platform DevOps Engineer at Code Metal designing and implementing cloud and hybrid infrastructure to support customer deployments and internal platforms. Collaborating with software and security teams for reliable delivery.
DevOps Platform Intern managing cloud infrastructure and deployment pipelines for AI - native software delivery. Partnering with a Product Development Intern, set up and manage containerized applications on Azure Kubernetes Service.
UNIX DevOps Engineer managing AIX and Solaris server operations for a Swiss telecom company. Focusing on automation, optimization and 7x24h monitoring responsibilities across multiple locations.
Staff Site Reliability Engineer designing and building backend services for NordVPN. High - ownership role focusing on system architecture and operational excellence.
Senior Site Reliability Engineer managing VPN and DNS services to ensure performance and reliability. Collaborating with application teams to maintain security and quality across global infrastructure operations.
Senior Site Reliability Engineer managing globally distributed VPN and DNS services. Optimizing service performance and handling security posture in a hybrid work environment.
Senior Site Reliability Engineer focused on observability for NordVPN. Designing monitoring systems and collaborating with data teams on anomaly detection.
Senior Site Reliability Engineer ensuring content accessibility across global edge infrastructure for NordVPN. Designing and troubleshooting systems critical to internet traffic management.