Staff Software Engineer joining Site Reliability team ensuring performance and reliability of legal AI platform. Designing monitoring and alerting systems while managing operations across global regions.
Responsibilities
Design, implement, and manage monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions
Lead incident management processes, including postmortems, root cause analyses, and driving actionable improvements
Automate operational tasks and workflows, building tools and processes for capacity planning, graceful rollouts, and safe data access to maintain high reliability and reduce manual intervention
Establish best practices for security, compliance, and reliability and collaborate across teams to drive these principles throughout the software lifecycle
Optimize infrastructure costs through strategic capacity planning and build-versus-buy decisions while maintaining system performance, reliability, and functionality
Provide technical mentorship and leadership, promoting best practices and fostering team growth
Requirements
10+ years of experience in Site Reliability Engineering or similar roles supporting production environments, with proven ability to mentor and guide technical teams
Expertise in infrastructure as code(IaC) tools (Pulumi, Terraform, CloudFormation, etc.)
Deep familiarity with observability tools (Datadog, Sentry, etc.) and incident response practices (PagerDuty, IncidentIO, etc.)
Proficiency with cloud infrastructure platforms (Azure, GCP, AWS, etc.)
Strong programming skills (Python, Bash, Go, or similar languages)
Proven track record of diagnosing complex system problems and implementing durable solutions
Solid understanding of CI/CD, Kubernetes, containerization, networking, databases, and cloud security principles
Excellent problem-solving skills, meticulous attention to detail, and a commitment to operational excellence
Work eligibility: Must be authorized to work in India. Visa sponsorship is not available for this role.
DevOps Engineer responsible for Azure infrastructure development and optimization at Bromcom. Ensuring stability, security, and scalability of the cloud platform with CI/CD automation and monitoring.
DevOps Engineer developing and maintaining CI/CD pipelines using Azure DevOps at RebelDot. Collaborating with teams on cloud and hybrid deployments in Romania.
Senior SRE Technical Lead responsible for reliability and scalability at Adobe's RealTime Customer Data Platform. Overseeing incident response and core datastore strategy in a high impact role.
Director of Site Reliability Engineering at Mastercard, overseeing resilience and operational excellence initiatives. Leading a high - performing team of technical leaders within CX Technology.
SRE responsible for designing and maintaining cloud infrastructure to support scalable applications. Collaborating with product teams to enhance monitoring and response systems in the Czech Republic.
Vehicle Reliability Engineer identifying and resolving issues for Waabi, a leader in Physical AI for autonomous transportation. Collaborating across teams to enhance vehicle reliability and performance.
DevOps Engineer responsible for maintaining cloud infrastructure at the leading crypto brand in the Philippines. Collaborating with legal and compliance teams to ensure requirements are met while monitoring and troubleshooting systems.
Tech Lead SRE managing technology talent and connecting them to impactful projects in a healthy work environment. Seeking professionals with a solid technical foundation and product mindset.
Senior DevOps Engineer modernising environment landscapes through IaC and SRE principles while collaborating across teams for a global engineering firm.