Senior Site Reliability Engineer at Unify addressing reliability challenges and scaling data infrastructure. Collaborating on backend services and ensuring stable platform performance for enterprise customers.
Responsibilities
Scale our data infrastructure: Optimize and extend our ClickHouse and PostgreSQL deployments—designing partitioning strategies, tuning queries, and improving replication and failover systems.
Improve system performance: Profile and optimize critical paths across backend services, identify bottlenecks in data pipelines and API layers, and ship changes that improve latency and throughput.
Build for reliability: Implement rate limiting, circuit breakers, graceful degradation, and other patterns that keep the platform stable under load and during partial failures.
Automate everything: Write tooling that eliminates toil—automating deployments, scaling operations, backup verification, and incident remediation.
Instrument and observe: Build out distributed tracing, metrics, and alerting that give engineers clear visibility into system behavior and accelerate debugging.
Respond and learn: Participate in on-call rotations, run incident response, and drive blameless postmortems that prevent recurrence.
Requirements
5+ years of software engineering experience with a strong backend foundation, including 2+ years focused on reliability, infrastructure, or platform work.
Hands-on experience operating databases at scale including query optimization, replication, and failover.
Strong programming skills (Typescript, Python, Go, or similar) with experience building automation and tooling.
Able to diagnose complex distributed systems issues under pressure and communicate clearly during incidents.
Collaborative, low-ego attitude and desire to work in a fast-paced environment.
Cloud Operations Engineer designing and implementing highly reliable cloud solutions. Leading cloud infrastructure initiatives for production operations and customer success in a growing team.
Quality Engineer supporting new product launches and reliability testing for SSD at Micron in Malaysia. Responsible for coordinating test activities and conducting failure analysis.
Reliability Engineer ensuring operational readiness of data centers at Rowan Digital Infrastructure. Overseeing commissioning, operational standards, and transitioning facilities into live operations.
Manager of Mechanical Engineering ensuring high - availability mechanical systems in data centers. Collaborating on lifecycle management and performance evaluation across missions - critical facilities in a hybrid role.
DevOps Engineer developing reusable Ansible and Puppet modules and managing CI/CD for project teams. Join PLATH in Hamburg, focusing on crisis detection software development.
Senior DevOps Engineer designing and maintaining CI/CD pipelines for a leading connectivity firm. Collaborating with cross - functional teams to optimize cloud infrastructure and enhance operational excellence.
Mechanical Reliability Engineer at Cargill ensuring asset reliability through advanced maintenance practices. Collaborating with teams and overseeing projects in heavy industrial processes.
Sr. DevOps Engineer at AllTrails focused on enhancing infrastructure reliability and security. Collaborating with engineering teams and contributing to projects for system optimization.
Senior IT Analyst focusing on SRE for Itaú, the largest bank in Latin America. Ensuring reliability and performance of critical systems through automation and incident resolution.