About the role

  • Senior Site Reliability Engineer managing AWS infrastructure for Tealium's customer data platform. Collaborating globally to enhance SaaS reliability and performance in a hybrid/remote setup.

Responsibilities

  • Participate in rotating 1st- and 2nd-level on-call approximately 20% of working time.
  • Improve the alert signal-to-noise ratio by focusing on services and business metrics over individual hosts.
  • Proactively research potential problems before they arise.
  • Continuously identify system bottlenecks, design and implement automation, and optimize infrastructure to improve reliability and cost efficiency.
  • Work closely with managers and the team on opportunities for monitoring, alerting, and trending improvements, as well as runbook and troubleshooting documentation.
  • Look for opportunities to improve application logging and exception handling.
  • Maintain and expand strong cross-functional relationships with Customer Success and Product Management teams.
  • Manage incidents and implement changes following Tealium’s Agile/ITIL processes.

Requirements

  • Minimum of 5+ years experience with Linux-based AWS (>10 services).
  • Minimum of 3+ years experience in a technical management capacity within a SaaS 24X7X365 environment.
  • Strong time management, organizational, oral, and verbal skills required.
  • Ability to articulate technical challenges and proposed solutions in a succinct, clear manner for all organizational levels.
  • Experienced with sophisticated AWS services and systems running at scale (millions to billions of transactions per day). (EC2/ALB, EKS/Kubernetes, AWS CLI).
  • Experienced with software development and/or scripting, CICD and infrastructure as code tools (Java, Go, Python, Jenkins, Git, Terraform, Jira/Confluence).
  • Experienced with Databases and Data Warehousing, such as DynamoDB, Redshift, Postgres.
  • Experienced with pub/sub streaming platforms such as Kafka/MSK.
  • Experienced with modern observability, logging, monitoring, alerting, trending, and dashboarding methods and tools (DataDog, SumoLogic, Jira Service Desk, Cloudwatch, Prometheus).
  • Strong attention to detail and aptitude for data-driven questioning.
  • Proven pragmatic usage of AI tools such as Kiro/Q, Claude, etc, to accelerate your work throughput.
  • Lifelong learning and curiosity in solving software technical challenges.

Benefits

  • Employees are eligible to receive an annual bonus and stock options.
  • Employees and their families are eligible for medical, dental, vision, life, and disability insurance.
  • Employees have the option to enroll in our 401k plan and are eligible to receive contributions for company matching.
  • Employees are eligible for flexible paid time-off and extended paid parental leave.
  • We offer 11 paid holidays annually.
  • We offer 15 hours of paid work time for volunteer activities and programs.
  • Our sick leave accrual is the following for our employees: Exempt CA employees (not including San Francisco) including NY : accrue 40 hours each year. Unused sick leave carries over into the next year. Employees cannot exceed 80 hours in a given year. Exempt Non - CA employees (not including NY) including SF: Accrue 1 hour every 30 hours worked. Cannot exceed 180 hours in the calendar year. Non-Exempt: accrue 1 hour every 30 hours worked. Unused carries over to the next year. Not to exceed 108 hours in a calendar year.

Job title

Senior SRE

Job type

Experience level

Senior

Salary

$140,000 - $155,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job