About the role

  • Site Reliability Engineer at HPE ensuring high availability and performance of cloud infrastructure across AWS and GCP environments. Managing incidents, monitoring systems, and supporting multi-cloud production.

Responsibilities

  • Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments.
  • Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark.
  • Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB.
  • Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems.
  • Collaborate closely with software engineering teams to debug and resolve complex production problems.
  • Participate in 24x7 on-call rotation supporting multi-cloud production environments.
  • Monitor system metrics, application performance, and infrastructure health using observability tools.
  • Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews.
  • Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency.
  • Perform capacity planning using system usage and performance data.
  • Drive SRE best practices, operational standards, and continuous improvement initiatives.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field.
  • 6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles.
  • Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS).
  • Experience with containerization and orchestration technologies, especially Docker and Kubernetes.
  • Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab.
  • Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver.
  • Strong understanding of Linux systems administration and configuration management tools like Ansible.
  • Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm.
  • Strong automation and scripting skills using Python, Go, Rust, or Shell scripting.
  • Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
  • Excellent analytical, troubleshooting, and problem-solving skills.
  • Strong communication and collaboration skills with the ability to work with cross-functional teams.

Benefits

  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion

Job title

Senior Site Reliability Engineer, SRE

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job