Site Reliability Engineer at HPE designing and optimizing cloud infrastructure and deployment systems. Impacting scalability, security, and operational efficiency across various platforms.
Responsibilities
As a Staff Software Engineer, you will play a key role in designing, building, and optimizing cloud infrastructure and deployment systems.
Your work will directly impact scalability, security, and operational efficiency across our platforms.
Enhance Infrastructure as Code (IAC) and enforce best practices.
Optimize cloud infrastructure for scalability, security, and cost-effectiveness.
Develop internal tools to support and streamline cloud platform operations.
Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins.
Address container image vulnerabilities and standardize remediation processes.
Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks.
Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools.
Troubleshoot complex production issues to ensure system reliability and customer satisfaction.
Fine-tune distributed systems such as Apache Kafka and Cassandra.
Collaborate with development, security, and operations teams to align infrastructure with application needs.
Requirements
Minimum of 6 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE).
Proficiency with Linux systems, especially Debian-based distributions.
Strong experience with cloud platforms such as AWS and GCP.
Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible.
Solid programming skills in Python and/or Golang.
Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE).
Experience with GitOps workflows.
Proven track record in implementing and maintaining CI/CD pipelines.
Strong background in security and familiarity with security programs.
Experience with monitoring and logging tools (Prometheus, Grafana, ELK).
Knowledge of both relational (SQL) and non-relational databases.
Excellent problem-solving and debugging skills with a strong sense of ownership.
Experience managing distributed systems like Apache Kafka and Cassandra.
Effective communicator and collaborative team player.
Benefits
Health & Wellbeing: We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Personal & Professional Development: We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division.
Unconditional Inclusion: We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.
DevOps Engineer assisting developers in leveraging DevOps tooling and best practices for Cat Digital applications. Collaborating closely with development teams to optimize delivery and troubleshooting.
Reliability Engineer providing strategic support at Y12 National Security Complex. Enhancing equipment reliability and maintainability through proactive maintenance strategies.
Upper Steering System Design and Release Engineer responsible for managing steering components and suppliers. Engaging in design and development of upper steering systems for Ford vehicles in a hybrid capacity.
Senior DevOps Engineer implementing CI/CD solutions for software projects. Requires expertise in Docker, Azure, and IAC tools in a hybrid work environment.
DevOps Engineer ensuring the stability and scalability of the justtrack platform. Collaborate with development teams managing the cloud infrastructure for a SaaS solution.
Site Reliability Intern ensuring smooth operation of Compute services and collaborating on tooling development. Participate in teams for system performance and reliability improvements in a global tech company.
Site Reliability Engineer at ING enhancing BTP platform services with a focus on reliability and scalability. Collaborating with cross - functional teams to drive continuous improvement and implement effective monitoring solutions.
DevOps role at Vodafone responsible for designing and maintaining decisioning workflows for automated credit vetting using DataView360 platform. Collaborate with analysts to translate requirements into technical solutions.
SRE Lead responsible for driving reliability and performance across Platform Engineering ecosystem at Birlasoft. Leading capacity planning, incident management, and mentoring SRE engineers.
Senior Director of Engineering leading the DevSecOps Platform team. Championing developer experiences and integrated practices to enhance security and effectiveness at FIS.