Hybrid Senior/Staff Site Reliability Engineer, Platform Engineering

Posted last week

Apply now

About the role

  • Staff Platform Engineer ensuring complex cloud-native systems remain highly available and secure at Saviynt. Driving automation and reliability improvements across multiple teams for their SaaS platform.

Responsibilities

  • Play a critical role in ensuring complex, distributed, cloud-native systems remain highly available, scalable, and secure
  • Own reliability for major platform domains and design scalable solutions on Kubernetes and AWS
  • Drive automation and reliability improvements across multiple teams
  • Instrumental in designing, building, and maintaining shared infrastructure services and platforms for product and application teams
  • Create reusable, reliable, and scalable solutions that abstract away complexity
  • Design and build core platform components and shared infrastructure services for deployment and operation of applications
  • Architect, implement, and manage highly available and scalable Kubernetes platforms for internal consumers
  • Develop internal-facing tools and automation for infrastructure provisioning and management using Go (Golang)
  • Architect and optimize foundational solutions within Cloud environments like AWS and Azure
  • Design and implement shared Event-Driven Architecture components and messaging platforms using technologies like Kafka or Google Pub/Sub
  • Develop and maintain CI/CD pipelines (e.g., GitLab CI and ArgoCD) for standardized deployment workflows
  • Design and build resilient Distributed Systems components focusing on reliability, fault tolerance, and performance
  • Manage and optimize shared infrastructure across Multi-Region Cloud Environments
  • Establish and enhance centralized Observability and Monitoring platforms for insights
  • Define and implement clear, well-documented RESTful API designs for internal clients
  • Implement and manage Service Mesh capabilities for traffic management and security
  • Design, implement, and optimize highly available Relational Database services
  • Collaborate closely with product development teams for infrastructure needs
  • Participate in on-call rotations to support critical shared infrastructure

Requirements

  • 6+ years of experience in an Infrastructure Development, Platform Engineering, or Site Reliability Engineering role, with a strong focus on building tools and services for other engineers
  • Deep expertise with Kubernetes in production environments, particularly in providing it as a platform(i.e single tenant and multi-tenant deployment architectures)
  • Strong programming skills in Go (Golang) and Python, with experience building robust, maintainable backend services and automation
  • Extensive hands-on experience with at least one major Cloud Provider (AWS, GCP, or Azure); multi-cloud experience is a strong plus, especially in building abstractions over them
  • Proven experience designing and implementing Event-Driven Architecture and message queuing systems (e.g., Kafka, RMQ, NATS) as shared services
  • Solid understanding and practical experience with CI/CD pipeline tools (especially GitLab CI) and experience establishing automated delivery processes for other teams
  • Demonstrable experience designing and operating Distributed Systems, with an understanding of patterns for creating reliable, shared components
  • Familiarity with Multi-Region Cloud Environments and strategies for building globally distributed and highly available platform
  • Proficiency in establishing and utilizing comprehensive Observability and Monitoring platforms (e.g., Prometheus, Grafana, ELK stack, Datadog) for shared infrastructure
  • Strong experience with RESTful API design principles and building well-documented, consumable APIs
  • Knowledge of Service Mesh concepts and practical experience with solutions like Istio in a platform context
  • Hands-on experience with Relational Databases (e.g., MySQL, PostgresSQL), ideally in managing them as a service
  • Excellent communication skills and the ability to clearly articulate complex technical concepts to both technical and non-technical audiences
  • A strong customer-centric mindset, treating internal development teams as your primary customers
  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience or equivalent military experience required.

Benefits

  • Work on a large-scale, cloud-native SaaS platform
  • Solve complex reliability challenges at scale
  • Influence platform architecture and engineering practices
  • Competitive compensation, benefits, and career growth

Job title

Senior/Staff Site Reliability Engineer, Platform Engineering

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job