Staff SRE Tech Lead overseeing platform reliability and scalability at Unify. Leading an SRE pod while enhancing infrastructure performance and implementing reliability best practices.
Responsibilities
Lead the SRE pod: Set technical direction, drive prioritization, and mentor engineers—ensuring the team is tackling the highest-leverage reliability and scalability challenges.
Scale our data infrastructure: Architect and extend our ClickHouse and PostgreSQL deployments to handle terabytes of new data monthly; designing partitioning strategies, tuning queries, and building resilient replication and failover systems.
Improve system performance: Profile and optimize critical paths across our backend services, identify bottlenecks in data pipelines and API layers, and ship changes that meaningfully improve latency and throughput.
Build for reliability: Design and implement rate limiting, circuit breakers, graceful degradation, and other patterns that keep the platform stable under load and during partial failures.
Automate everything: Drive tooling that eliminates toil—automating deployments, scaling operations, backup verification, and incident remediation.
Instrument and observe: Build out distributed tracing, metrics, and alerting that give engineers clear visibility into system behavior and make debugging production issues fast.
Define and enforce SLOs: Establish reliability targets aligned with customer needs, manage error budgets, and drive architectural decisions that balance shipping speed with system stability.
Requirements
8+ years of software engineering experience with a strong backend foundation, including 3+ years focused on reliability, infrastructure, or platform work.
Experience leading teams or pods—setting technical direction, mentoring engineers, and driving execution on complex projects.
Deep expertise operating databases at scale, including schema design, query optimization, replication, and failover strategies.
Strong programming skills (Typescript, Python, Go, or similar) with a track record of building automation and tooling that meaningfully reduces operational burden.
Collaborative, low-ego attitude with a history of leveling up the people around you.
Senior DevOps Engineer responsible for leading CI/CD pipeline design and optimization. Collaborating with teams to drive DevOps maturity across the enterprise while managing infrastructure automation.
Cloud Operations Engineer ensuring reliable performance of cloud systems at 2Innovate. Focused on automation, incident management, cloud security, and infrastructure monitoring in cloud environments.
AWS DevOps Engineer responsible for delivering scalable digital experiences for EXL's MarTech ecosystem. Engaging in development, maintenance, and collaboration across stakeholders and services.
Senior Site Reliability Engineer managing critical infrastructure at Hornetsecurity. Collaborating with product teams to ensure performance and reliability across services.
Site Reliability Engineer enhancing platform reliability for AI workflows at WRITER. Overseeing automated solutions and cloud infrastructure supporting high - trafficked AI systems.
Site reliability engineer ensuring 24/7 availability of AI - powered workflows at WRITER. Developing and automating robust platforms for high - traffic AI demands.
Site Reliability Engineer maintaining cloud infrastructure for Tricentis SaaS Products. Collaborating closely with engineers, focusing on observability and performance.