Site Reliability Engineer ensuring reliability and performance of Equisoft’s SaaS applications. Collaborating with development and operations teams while managing incidents and optimizing infrastructure.
Responsibilities
Monitor daily SaaS operations to ensure consistent performance, reliability, and availability of services for customers.
Ensure adherence to SLAs (Service Level Agreements) by proactively monitoring and addressing potential issues to maintain high uptime and service quality.
Execute incident management procedures for outages or performance issues, including troubleshooting, root cause analysis, and post-mortem reviews.
Work on improving the operational efficiency of SaaS applications by fine-tuning infrastructure, monitoring systems, and optimizing performance.
Ensure all SaaS applications meet required security and compliance standards, conducting regular audits and addressing vulnerabilities proactively.
Identify areas for process improvement, driving automation initiatives to streamline workflows, reduce manual work, and enhance operational efficiency.
Act as a point of escalation for customer issues related to SaaS applications, working with support teams to resolve high-priority cases.
Monitor, analyze, and report on operational metrics (uptime, response times, incident counts), providing regular updates to stakeholders with updated documentation.
Participate in disaster recovery exercises, ensuring regular backups and testing recovery processes for business continuity.
Ensure SaaS operations align with industry standards and best practices, to provide a structured and effective service management approach.
Work closely with development and operations teams to ensure seamless integration and deployment.
Address and resolve production issues promptly to minimize downtime.
Participating in on-call incidents, troubleshooting issues and performing root cause analysis on rotations to ensure 24/7 system availability.
Requirements
Technical Bachelor’s Degree in Computer Engineering or Information Technology or College Diploma combined with 3 years of relevant experience
3+ years of experience in a similar role (Site Reliability Engineer, Production Support Engineer, DevOps, Programmer or related)
Proven track record of managing and optimizing production systems
Strong knowledge of system administration, networking, and Azure cloud services
Experience with CI/CD pipelines and infrastructure as code (e.g. Terraform)
Experience with monitoring and alerting tools (e.g. Azure Monitor, Application Insights)
Hands-on experience with Azure Kubernetes Service (AKS), Azure Container Instances, and container orchestration
Experience working closely with software development teams
Ability to read and understand code (exemple .Net, C#, Java or Python) to assist in debugging and identifying root causes of issues
Familiarity with application logs, stack traces, and performance profiling tools to pinpoint problems efficiently
Solid understanding of Azure SQL Database, Cosmos DB, and other Azure data services
Excellent knowledge of English (spoken and written)
Benefits
medical
dental
term life/personal accident coverage
wellness sessions
telemedicine program
flexible hours
Educational Support (LinkedIn Learning, LOMA Courses and Equisoft University)
DevOps Engineer focusing on enhancing software build and deployment processes at 4DMedical. Collaborating with software development teams to improve internal tooling and automation.
Software Developer focusing on DevOps at Proway, working on CI/CD and container environments. Seeking experienced candidates with C++ and Python programming skills based in Ulm, Germany.
Junior DevOps Engineer at Enlighten, working with AWS and Kubernetes to manage cloud infrastructure. Supporting deployment processes and collaborating with development teams in a hybrid environment.
Senior DevOps Engineer responsible for managing Kubernetes clusters and CI/CD pipelines in a technology startup. Collaborating with teams to enhance AI - assisted software applications and support various client projects.
Reliability Engineering Manager at Nestlé driving improvements in maintenance and engineering processes. Leading teams in establishing a zero loss culture for sustainable production efficiency.
Associate DevSecOps Engineer supporting R&D tools deployment in Bologna. Hands - on exposure to DevSecOps and containerized services in a growing tech environment.
DevOps Coordinator overseeing AWS cloud infrastructure and CI/CD pipelines at BMW TechWorks Romania. Leading operational stability efforts while managing technical teams and enhancing system reliability.
Senior Reliability Engineer responsible for maintaining and improving plant asset reliability processes while ensuring safe operations and high product quality. Requires collaboration with clients and complex problem - solving skills.
Senior Site Reliability Engineer at PulseRise Technologies building and scaling reliability foundations for a fintech platform. Leading incident response and designing resilient AWS architectures in a hybrid environment.