Principal Site Reliability Engineer at HPE optimizing cloud infrastructure and deployment systems. Key responsibilities involve enhancing IAC, improving CI/CD pipelines, and ensuring system reliability.
Responsibilities
Enhance Infrastructure as Code (IAC) and enforce best practices.
Optimize cloud infrastructure for scalability, security, and cost-effectiveness.
Develop internal tools to support and streamline cloud platform operations.
Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins.
Address container image vulnerabilities and standardize remediation processes.
Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks.
Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools.
Troubleshoot complex production issues to ensure system reliability and customer satisfaction.
Fine-tune distributed systems such as Apache Kafka and Cassandra.
Collaborate with development, security, and operations teams to align infrastructure with application needs.
Requirements
Minimum of 10 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE).
Proficiency with Linux systems, especially Debian-based distributions.
Strong experience with cloud platforms such as AWS and GCP.
Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible.
Solid programming skills in Python and/or Golang.
Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE).
Experience with GitOps workflows.
Proven track record in implementing and maintaining CI/CD pipelines.
Strong background in security and familiarity with security programs.
Experience with monitoring and logging tools (Prometheus, Grafana, ELK).
Knowledge of both relational (SQL) and non-relational databases.
Excellent problem-solving and debugging skills with a strong sense of ownership.
Experience managing distributed systems like Apache Kafka and Cassandra.
Effective communicator and collaborative team player.
Benefits
Health & Wellbeing We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Personal & Professional Development We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division.
Unconditional Inclusion We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.
DevOps Senior role at Beyond Soluções managing CI/CD for .NET and Kubernetes applications. Collaborating on cloud solutions while fostering a culture of innovation and quality.
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.