Site Reliability Engineer managing incident response and system reliability for healthcare AI platform, supporting day-to-day operations and improving platform stability leads.
Responsibilities
Participate in on-call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end-to-end.
Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases.
Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals.
Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling to make on-call and day-to-day operations easier and safer.
Support safe change: Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change.
Contribute to operational practices: Write and maintain runbooks, participate in blameless post-mortems, and help improve incident response processes over time.
Collaborate closely with engineers: Work with product and feature teams to improve production readiness, service ownership, and reliability expectations.
Requirements
3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
Experience supporting production systems and participating in on-call rotations.
Comfortable debugging live systems under pressure.
Working knowledge of Kubernetes and containerised workloads.
Infrastructure as Code experience (Terraform or similar).
Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
Scripting or automation experience (Python, Bash, or similar).
Benefits
Real product momentum. We’re not trying to generate interest, we’re channeling it.
Equity from day one. When Heidi wins, you win. You’ll share directly in the success you help create.
Unmatched impact. Play a pivotal role in defining and scaling customer success at a critical growth moment - all while working on a product that delivers tangible value to clinicians and patients every day.
Work alongside world-class talent. Join a team of operators and builders who’ve scaled unicorns.
Global reach. Help shape our international expansion as we bring Heidi to key international markets.
Growth and balance. Enjoy a personal development budget, work from anywhere for a month, dedicated wellness days, and your birthday off to recharge.
Flexibility that works. A hybrid environment, with 3 days in the office.
Job title
Site Reliability Engineer – Mid-Senior, Operations-Focused
DevOps Engineer at ventx GmbH based in München, operating in a hybrid setup. Responsibilities include CI/CD implementations and cloud infrastructure management.
DevOps Engineer responsible for infrastructure solutions in mobile data services at Saily. Working with AI and developing CI/CD processes within the company.
DevOps Engineer responsible for infrastructure implementation at Saily, enhancing mobile data connectivity solutions. Engaging in complex problem - solving within the Infrastructure team in a hybrid environment.
Senior DevOps Engineer managing cloud solutions at FORTE CLOUD. Handling deployment, migration, and integration while ensuring high quality and scalability.
Dev Ops Engineer at DATAGROUP managing applications and cloud technology transformations. Collaborating with clients and teams to enhance IT landscapes and operations.
DevOps Engineer helping deploy MVP, CRM, and billing systems for Newrich Network. Focused on infrastructure, automation, and building for scale with potential to go full - time.
Cloud Operations Engineer supporting and maintaining multi - cloud public infrastructure for enterprise customers. Working in structured ITIL environment and contributing to operational excellence.
DevOps Engineer building and maintaining authentication platforms in multi - cloud environments. Using technologies like Terraform, Ansible, and Python for automation and optimization.
Cloud Engineer developing Infrastructure - as - Code with Terraform and Azure DevOps. Managing Azure infrastructure and leading incident response within cross - functional teams.
DevSecOps Engineer at Skillfield working on secure CI/CD pipelines for mobile - first delivery. Collaborating with teams to embed security and automation in the delivery lifecycle.