Staff Platform Engineer ensuring complex cloud-native systems remain highly available and secure at Saviynt. Driving automation and reliability improvements across multiple teams for their SaaS platform.
Responsibilities
Play a critical role in ensuring complex, distributed, cloud-native systems remain highly available, scalable, and secure
Own reliability for major platform domains and design scalable solutions on Kubernetes and AWS
Drive automation and reliability improvements across multiple teams
Instrumental in designing, building, and maintaining shared infrastructure services and platforms for product and application teams
Create reusable, reliable, and scalable solutions that abstract away complexity
Design and build core platform components and shared infrastructure services for deployment and operation of applications
Architect, implement, and manage highly available and scalable Kubernetes platforms for internal consumers
Develop internal-facing tools and automation for infrastructure provisioning and management using Go (Golang)
Architect and optimize foundational solutions within Cloud environments like AWS and Azure
Design and implement shared Event-Driven Architecture components and messaging platforms using technologies like Kafka or Google Pub/Sub
Develop and maintain CI/CD pipelines (e.g., GitLab CI and ArgoCD) for standardized deployment workflows
Design and build resilient Distributed Systems components focusing on reliability, fault tolerance, and performance
Manage and optimize shared infrastructure across Multi-Region Cloud Environments
Establish and enhance centralized Observability and Monitoring platforms for insights
Define and implement clear, well-documented RESTful API designs for internal clients
Implement and manage Service Mesh capabilities for traffic management and security
Design, implement, and optimize highly available Relational Database services
Collaborate closely with product development teams for infrastructure needs
Participate in on-call rotations to support critical shared infrastructure
Requirements
6+ years of experience in an Infrastructure Development, Platform Engineering, or Site Reliability Engineering role, with a strong focus on building tools and services for other engineers
Deep expertise with Kubernetes in production environments, particularly in providing it as a platform(i.e single tenant and multi-tenant deployment architectures)
Strong programming skills in Go (Golang) and Python, with experience building robust, maintainable backend services and automation
Extensive hands-on experience with at least one major Cloud Provider (AWS, GCP, or Azure); multi-cloud experience is a strong plus, especially in building abstractions over them
Proven experience designing and implementing Event-Driven Architecture and message queuing systems (e.g., Kafka, RMQ, NATS) as shared services
Solid understanding and practical experience with CI/CD pipeline tools (especially GitLab CI) and experience establishing automated delivery processes for other teams
Demonstrable experience designing and operating Distributed Systems, with an understanding of patterns for creating reliable, shared components
Familiarity with Multi-Region Cloud Environments and strategies for building globally distributed and highly available platform
Proficiency in establishing and utilizing comprehensive Observability and Monitoring platforms (e.g., Prometheus, Grafana, ELK stack, Datadog) for shared infrastructure
Strong experience with RESTful API design principles and building well-documented, consumable APIs
Knowledge of Service Mesh concepts and practical experience with solutions like Istio in a platform context
Hands-on experience with Relational Databases (e.g., MySQL, PostgresSQL), ideally in managing them as a service
Excellent communication skills and the ability to clearly articulate complex technical concepts to both technical and non-technical audiences
A strong customer-centric mindset, treating internal development teams as your primary customers
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience or equivalent military experience required.
Benefits
Work on a large-scale, cloud-native SaaS platform
Solve complex reliability challenges at scale
Influence platform architecture and engineering practices
Competitive compensation, benefits, and career growth
Job title
Senior/Staff Site Reliability Engineer, Platform Engineering
Vulnerability & Configuration Management Engineer responsible for vulnerability management and remediation processes at Relax Gaming. Collaborate with IT teams to improve security measures across various platforms.
DevOps Engineer for designing and maintaining Azure - based hybrid cloud infrastructure for a company specializing in nature - based smart city solutions. Leading cloud architecture and mentoring engineers as part of a high - impact team.
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.
DevOps Senior role at Beyond Soluções managing CI/CD for .NET and Kubernetes applications. Collaborating on cloud solutions while fostering a culture of innovation and quality.
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.