Hybrid Staff Site Reliability Engineer

Posted 6 days ago

Apply now

About the role

  • Staff Site Reliability Engineer managing GCP/GKE and AI-driven workflows at Achievers. Leading initiatives to build reliable, scalable cloud systems and enhancing infrastructure resilience.

Responsibilities

  • Lead high-impact initiatives that shape how millions of people experience work around the world.
  • Bring your unique perspective to complex and challenging projects - apply your expertise in architecture, influence technical direction, and mentor fellow team members.
  • Join a close-knit, no-ego, high-performing team that solves meaningful problems and celebrates successes together.
  • Work alongside an experienced leadership team who is genuinely invested in your career growth.
  • Thrive in a fast-paced, high-growth environment where innovation is encouraged and your voice truly matters.
  • Lead the design and ongoing evolution of our global, high-availability infrastructure, focusing on Google Cloud Platform (GCP) and Kubernetes (GKE).
  • Identify repetitive operational tasks and implement AI-integrated workflows, such as Slack or Teams bots for incident triage, AI-augmented alerting, and automated PR generation to address infrastructure drift.
  • Collaborate with Product, Engineering, and Leadership teams to identify systemic risks, manage complex changes, and define the long-term reliability roadmap.
  • Establish and exemplify best practices for Terraform and CI/CD pipelines, empowering development teams to deploy code rapidly and securely.
  • Lead high-level initiatives in disaster recovery, multi-region networking, and the design of zero-trust security architectures.
  • Guide design reviews and promote best practices, enhancing the technical skills and capabilities of the entire SRE organization.

Requirements

  • Possess extensive systems engineering experience, with in-depth knowledge of Linux kernels, network protocols (TCP/IP, BGP, DNS), and cloud-native architecture.
  • Demonstrated, hands-on experience in architecting and managing production workloads on Google Cloud Platform and GKE.
  • Practical experience or a strong vision for integrating AI tools and LLMs to automate SRE tasks, documentation, or incident response.
  • Advanced skills in Python or Go, with the ability to develop sophisticated internal tools and automation frameworks.
  • Expert understanding of observability frameworks (such as New Relic, Prometheus, Grafana) to enable data-driven decision-making.
  • Deep knowledge of managing relational databases (MySQL, MongoDB) at scale.
  • Exceptional ability to clearly convey complex technical infrastructure challenges as actionable business insights to non-technical stakeholders.
  • Set industry trends by applying emerging technologies like AI to address longstanding infrastructure challenges.
  • Maintain a mindset of continuous improvement, always seeking opportunities to automate processes.
  • Believe that platform reliability is fundamental to both employee success and customer trust.

Benefits

  • Rewards for your impact through our Recognition and Rewards program
  • Health Benefits and Life Insurance Coverage beginning on your first day
  • Parental Leave Top-up
  • Employer matched RRSP contributions
  • Flexible Vacation to recharge, so you can bring your best
  • Employee and Family Assistance Program offering mental health, legal, and financial counselling
  • Supported professional development and career growth (Linkedin Learning, mentorship)
  • Employee-Led Employee Resource Groups that celebrate our diversity
  • Regular events designed to build connection, belonging, and well-being
  • Hybrid flexibility, with time in our beautiful Liberty Village, Toronto office

Job title

Staff Site Reliability Engineer

Job type

Experience level

Lead

Salary

$124,000 - $170,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job