Hybrid Lead Site Reliability Engineer

Posted last month

Apply now

About the role

  • Implement monitoring and alerting systems to guarantee high availability and performance, focused on SLA and availability metrics
  • Collaborate with engineering and operations teams to identify critical components requiring enhanced availability measures
  • Design and implement strategies, tooling, and processes to enhance system uptime and reliability
  • Continuously evaluate and recommend improvements to platform infrastructure and processes
  • Align the platform with customer needs and business goals by working closely with cross-functional teams
  • Run the production environment by monitoring availability and taking a holistic view of system health
  • Build software and systems to monitor platform infrastructure and applications
  • Monitor and improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, get ahead of customer needs, and innovate for continual improvement
  • Provide primary operational support and engineering for multiple large-scale distributed software applications
  • Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding

Requirements

  • Bachelor’s degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required, Master’s degree a plus
  • 6+ years professional experience Monitoring and Alerting roles on major cloud platforms (AWS, Azure)
  • 4+ experience in Cloud development (AWS, Azure) and observability skills
  • 3+ years of experience in software development with Python, NodeJS, or Java with a focus on SDLC and automation
  • Hands-on experience with container orchestration, preferably with Kubernetes
  • Hands-on experience with building observability, monitoring and alerting on large scale distributed systems
  • Leadership/design of application and/or infrastructure migration projects from on-prem to cloud
  • Cloud architecture design and implementation experience
  • Familiarity with current AWS solutions; Azure experience also considered
  • Experience with containerized workloads (Helm; AKS & EKS, Docker, JFrog)
  • Experience with logging and monitoring tools (Prometheus, Grafana, Datadog, AWS Cloudwatch, Azure Monitor, Log Analytics, Fluentd, ELK/OpenSearch, OpenTelemetry)
  • Network Security knowledge (IAM/Policy, Azure Policy, VPN, Active Directory/RBAC, ACLs, NSG rules, private endpoints)
  • Proven experience implementing advanced observability practices and techniques at scale
  • Ability to automate resolution of alerts and automate with scripting languages (Python, Golang, Shell)
  • Knowledge of managing systems using infrastructure as code tools (Terraform, ARM, Chef)
  • Solid understanding of Cloud Computing and DevOps concepts
  • Proven experience in maintaining scalability and resiliency of complex environments
  • Ability to triage, execute root cause analysis, and be decisive under pressure
  • Experience managing and interpreting large datasets using query languages and visualization tools
  • Proficient communication skills and ability to work with diverse teams

Job title

Lead Site Reliability Engineer

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job