SRE Lead architecting reliability practices for Upvest, a fintech driving easy investment through innovation. Shape reliability standards and mentor SREs in a fast-paced environment.
Responsibilities
Establish SLOs, SLIs, and error budgets that become the shared language of engineering velocity and stability.
Design chaos experiments, redundancy patterns, and failover strategies that validate resilience before customers ever see an issue.
Transform how we respond to and learn from incidents, turning chaos into confidence through world-class runbooks and post-mortem culture.
Drive load testing, benchmarking, and architecture reviews that ensure we can handle 10x our current scale.
Eliminate toil ruthlessly. Build tools that let engineers focus on value, not repetitive operational drudgery.
Define the principles and patterns (circuit breakers, bulkheading, graceful degradation, intelligent rate limiting) that make our services antifragile.
Validate system behavior under synthetic load and adversarial conditions, proving our defenses work before attackers or traffic spikes test them.
Requirements
Deep SRE Mastery: years in high-stakes environments (FinTech, payments, banking, trading, mission-critical SaaS) where downtime has real consequences.
Technical Depth: hands-on expertise with SLOs, chaos engineering, observability, automation, and the discipline of eliminating toil.
Systems Thinking: you understand resilience architecture deeply. You design systems that fail gracefully, not engineers who scramble desperately.
Influence Without Authority: exceptional communication and stakeholder management skills. You can change minds and shape culture through clarity, not mandates.
Leadership DNA: proven ability to hire A-players, mentor engineers, and guide career growth while setting technical direction.
Bonus points: Investment Industry Fluency: understanding our domain means your SRE decisions align with business reality, not just technical ideals.
Stack Familiarity: experience with Golang, Kubernetes, GCP, Postgres, Kafka, or Datadog accelerates your impact.
Benefits
We're working on solving a hard problem: Fixing the European securities financial infrastructure that empowers more people to be able to invest.
We invest in you: From access to a personal coach, development budget, and plenty of opportunities to grow in your role.
We live a culture of empowerment: We trust that we hire the best people and get out of their way. We value openness—there's a greater advantage in sharing information than keeping it to ourselves.
Flexible work environment: While we're not quite fully remote, we are committed to being a flexible employer, as we understand you don't have to be in the office to do your best work.
Leading DevOps platform strategy for KIPMI Software's next - generation digital trust products. Collaborating with teams to implement scalable infrastructure and DevSecOps practices.
Join our DevOps team to build and manage GitHub pipelines and cloud - native Azure solutions. Collaborate with teams to drive DevOps best practices and optimize deployments.
Site Reliability Engineer enhancing system reliability and deployment practices at OpenLoop. Collaborating with cross - functional teams for incident management and performance tuning.
Senior DevOps Engineer enhancing Azure application reliability for a healthcare fintech platform. Collaborating closely with engineering teams to ensure deploy safety and observability.
DevOps Engineer contributing to tooling changes and leading a community of practice at Totara. Focused on collaboration, development, and support for internal teams.
Site Reliability Engineer responsible for infrastructure supporting AI platform. Safeguarding US customer data and ensuring compliance in the Aerospace and Defense sector.
Senior Infrastructure Engineer managing Azure platform for a SaaS product at Rillion. Focused on automation, security, reliability, and scalability in a hybrid work environment.
Statistician/Reliability Engineer applying statistical analysis for satellite systems at Aerospace Corporation. Leading projects on system reliability and working closely with interdisciplinary teams in a full - time on - site role.
DevOps Engineer designing and implementing solutions to optimize operations in media technology at Mediagenix. Collaborating with cross - functional teams to enhance user experiences.