Senior Site Reliability Engineer managing cloud infrastructure for SaaS solutions at PROS Holdings. Focusing on reliability, automation, and team collaboration in a hybrid work environment.
Responsibilities
Design, implement, and maintain secure, scalable infrastructure across cloud environments
Analyze cloud environment requirements from various sources, document system designs, and implement necessary modifications
Automate repetitive system tasks and manage system-related activities for internal and external clients, including Professional Services support
Ensure system reliability through robust failover mechanisms, disaster recovery processes, and 24/7 support strategies
Design, implement, and improve monitoring tools to meet SLOs, ensuring a “Monitor by Design” approach is adopted across product teams
Continuously drive reliability improvements through proactive initiatives, data-driven SLO adjustments, and advanced monitoring/alerting solutions
Lead and coordinate disaster recovery testing exercises and capacity planning to enhance system reliability
Identify and reduce operational toil through automation and tool development
Apply and enforce security best practices across cloud environments, while mentoring team members on SLO achievement
Facilitate cross-team communication, provide training, and maintain clear documentation (e.g., runbooks and procedures)
Support cloud environment management and propose technology changes to improve performance and reliability.
Requirements
7+ years of experience as a System Administrator, DevOps Engineer, SRE, or similar role
Deep knowledge of Linux administration, including performance monitoring, tuning and troubleshooting
Experience with cloud network design (Azure preferred, AWS or GCP also considered)
Proficiency in scripting (e.g., Bash, Python) for automation
Experience with version control software (preferably Git)
Experience with configuration management tools (e.g., Puppet, Foreman, Ansible, or similar)
Knowledge of container orchestration tools (e.g., Kubernetes, Docker Swarm, etc.)
In-depth knowledge of monitoring and logging solutions for cloud infrastructure (e.g., Prometheus, Grafana, etc.)
Bachelor’s degree in Computer Science or a related field
Excellent time management, organizational, crisis management, and problem-solving skills
Self-starter, able to work independently without direct supervision
Willingness to innovate, learn, and share knowledge
Excellent verbal and written communication skills
Experience developing and implementing IT security best practices and procedures
Willingness to participate in on-call rotations and respond to incidents in a timely and effective manner
Senior Reliability Engineer applying a variety of reliability techniques and managing projects at Baker Hughes. Collaborating with teams to meet customer expectations and enhance their success.
Staff Site Reliability Engineer managing large - scale systems and ensuring infrastructure reliability for NordVPN's services. Collaborate on automating platforms and solving complex technical challenges.
Site Reliability Engineer responsible for infrastructure performance and reliability at ASAPP, collaborating with product engineering teams and automating processes.
DevOps Technical Lead specializing in automation and CI/CD pipeline management at Stanley Black & Decker. Leading a team to enhance cloud infrastructure within an innovative technology environment.
DevOps Engineer for Vodafone Innovus enhancing DevOps solutions in IoT applications. Collaborating with software, QA, and systems engineers to optimize deployment and continuous integration.
DevOps Engineer accountable for the Salesforce DevOps program at S&P Global. Collaborating with Agile teams, managing releases, and enhancing DevOps processes.
DevSecOps Engineer designing secure cloud infrastructure at CredLens, ensuring best practices in security throughout the development lifecycle. Collaborating with engineering and data teams on dependability and compliance.
Senior Site Reliability Engineer ensuring reliability, scalability, and performance of services at Granicus. Leading automation processes and implementing best practices in site reliability engineering.
Senior Site Reliability Engineer at Coinbase, focusing on identity and access management tooling. Responsibilities include automation, cloud - native development, and maintaining secure system architectures.
Join CORTO as a DevOps Engineer working on AWS infrastructure for enhancing legal tech solutions. Collaborate with a high - achieving team to optimize and support development environments.