Site Reliability Engineer at HPE ensuring high availability and performance of cloud infrastructure across AWS and GCP environments. Managing incidents, monitoring systems, and supporting multi-cloud production.
Responsibilities
Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments.
Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark.
Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB.
Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems.
Collaborate closely with software engineering teams to debug and resolve complex production problems.
Participate in 24x7 on-call rotation supporting multi-cloud production environments.
Monitor system metrics, application performance, and infrastructure health using observability tools.
Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews.
Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency.
Perform capacity planning using system usage and performance data.
Drive SRE best practices, operational standards, and continuous improvement initiatives.
Requirements
Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field.
6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles.
Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS).
Experience with containerization and orchestration technologies, especially Docker and Kubernetes.
Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab.
Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver.
Strong understanding of Linux systems administration and configuration management tools like Ansible.
Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm.
Strong automation and scripting skills using Python, Go, Rust, or Shell scripting.
Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Excellent analytical, troubleshooting, and problem-solving skills.
Strong communication and collaboration skills with the ability to work with cross-functional teams.
Site Reliability Engineer responsible for system reliability and performance at a leading financial services technology company. Collaborating with infrastructure, engineering, and security teams to build robust systems.
Principal Release Engineer leading and orchestrating end - to - end release management at F5. Driving cross - platform coordination and ensuring seamless releases across enterprise transformation programs.
Site Reliability Engineer focused on developing and improving Kubernetes configurations for F5's infrastructure. Collaborating with product teams and ensuring operational delivery processes are efficient and reliable.
Sr DevOps Manager leading the way in Cloud infrastructure, DevOps, and SRE practices at F5. Empowering engineers and fostering a culture of collaboration and improvement.
Senior Site Reliability Engineer developing IT infrastructure and automation solutions for Coinbase. Collaborating with Infrastructure, security, and compliance teams to enhance operational efficiency.
DevOps Engineer joining AI and Innovation team to ensure scalable, secure, and resilient systems at global media agency. Collaborating with UX and AI engineers for next - generation media experiences.
Senior SRE/DevOps managing cloud architecture, driving automation, and ensuring operational reliability at Extensiv. Collaborating with teams to design scalable systems on AWS.
Site Reliability Engineer responsible for architecting cloud infrastructure and containerized platforms at Vista Global. Implementing CI/CD pipelines and mentoring teams on best practices for production environments.
Site Reliability Engineer supporting Vista Global’s production environments and cloud infrastructure. Delivering solutions using AWS, Terraform, Ansible, Docker, and Kubernetes in a hybrid model.
Senior DevOps Engineer focused on network automation and cloud infrastructure at Tiger Analytics. Building scalable solutions for multiple Fortune 500 companies and ensuring high availability and performance.