Hybrid Associate Principal, Site Reliability Engineering

Posted 2 months ago

Apply now

About the role

  • Associate Principal in Site Reliability Engineering enhancing availability and performance of OCC’s Ovation platform. Collaborating on automation and mentoring junior team members in cloud technologies.

Responsibilities

  • Provide strong support for the availability and performance of OCC’s next generation Ovation platform.
  • Enhance system reliability and developer productivity through automation.
  • Provide guidance to development, platform teams, in the areas of cloud technologies, application profiling and monitoring, logging, metrics collection and analysis.
  • Collaborate with development, operations and infrastructure teams to ensure availability of services, and to work through implementation issues
  • Develop automation for incident response and to prevent problem recurrence
  • Create and enhance runbooks to respond to service outages or degradations
  • Assess the production readiness of services
  • Define and track operational metrics for production performance, reliability, scalability and availability
  • Architect, develop and maintain shared services and tools to improve reliability and reduce toil across the organization
  • Contribute to the team’s continuous improvement through research, retrospectives, discussion groups and code reviews
  • Influences timelines and expectations amongst the team
  • Provide knowledge by guiding and mentoring junior members, and preparing stories for the sprint backlog

Requirements

  • [Required] Experience with maintaining and troubleshooting large-scale distributed systems
  • [Required] Experience with Agile / Scrum methodology
  • [Required] Able to succeed in fast-paced environment with frequent changes
  • [Required] Comfortable communicating with both technical and non-technical audiences
  • [Required] Strong documentation skills
  • [Required] Analytical problem-solving approach
  • [Required] Self-starter – takes the initiative to research, learn and deliver. Anticipates the play
  • [Required] Team player – humble, collaborative, and focused on making sure the entire team succeeds
  • [Required] Experience managing infrastructure in public cloud environments like AWS (preferred), Azure or GCP
  • [Required] Experience with AIOps and predictive analysis for anomaly detection, forecasting system capacity using monitoring and alerting tools like Splunk, AppDynamics, Datadog, StackDriver, Sysdig, Prometheus or Grafana
  • [Required] Programming/scripting experience in languages like Java, Bash, Python or Go
  • [Required] Experience with distributed messaging systems like Kafka, RabbitMQ, or ActiveMQ
  • [Required] Experience with container orchestration systems like Kubernetes, Mesos, Docker Swarm or Rancher
  • [Required] Experience with using Continuous Integration and Continuous Delivery (CI/CD) tools like Jenkins, Travis, Harness, Appveyor, CodeBuild or CodePipeline
  • [Required] Familiarity with leveraging large language models (LLMs) to automate and optimize SRE workflows. This may include using AI-powered tools to perform tasks such as, writing scripts, summarizing incident reports, or even creating and maintaining AI workloads.
  • [Required] Basic exposure to Chaos Engineering tools like, Gremlin, Chaos Monkey, Harness Chaos Engineering, or cloud-native fault injection services like AWS FIS.
  • [Required] Bachelor’s or Master’s Degrees in Computer Science, Information Systems or other related field, or equivalent work experience
  • [Required] Minimum of 4+ years of experience in Site Reliability Engineering / DevOps

Benefits

  • A hybrid work environment, up to 2 days per week of remote work
  • Tuition Reimbursement to support your continued education
  • Student Loan Repayment Assistance
  • Technology Stipend allowing you to use the device of your choice to connect to our network while working remotely
  • Generous PTO and Parental leave
  • 401k Employer Match
  • Competitive health benefits including medical, dental and vision

Job title

Associate Principal, Site Reliability Engineering

Job type

Experience level

JuniorMid level

Salary

$118,300 - $192,400 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job