About the role

  • Senior Site Reliability Engineer ensuring system reliability and scalability at Stellar Development Foundation. Collaborating on cloud infrastructure, Kubernetes, and operational excellence for blockchain technology.

Responsibilities

  • Maintain, improve, scale and secure our AWS/GCP infrastructure and Linux systems.
  • Assist our development teams in running, packaging, deploying and troubleshooting applications
  • Work with developers on streamlining deployment processes with Jenkins and other CI/CD tooling.
  • Build, maintain, monitor and improve our Kubernetes clusters.
  • Work with development teams on migrating applications to Kubernetes.
  • Be responsible for maintenance and improvements to multiple internal services, for example Kubernetes, Prometheus, ELK.
  • Monitor, triage and respond to alerts in our high availability environments.
  • Participate in design and code reviews, and ensure that the foundation for our services is best in class.
  • Evaluate new technologies, design and implement as appropriate.
  • Identify automation opportunities and implement by creating custom or by using off the shelf solutions.

Requirements

  • 5+ years of experience of working in cloud-based systems operations, as a SRE or DevOps engineer.
  • First-hand experience with configuration management and infrastructure as code (Ansible, Puppet, Terraform).
  • Proficient in utilizing SRE methodologies like capacity planning and disaster recovery testing to ensure the scalability, resilience, and availability of critical services.
  • A strong understanding of computer networking, TCP/UDP, load balancing, distributed computing, web services, and the fundamental protocols used by the internet (HTTP, HTTPS, DNS, etc.).
  • Experienced in managing production workloads and skilled in using monitoring tools to detect issues early.
  • Comfortable with participating in on-call rotations and conducting thorough root cause analyses to keep systems running smoothly.
  • Proficiency in at least one programming language.
  • Committed to supporting teammates, especially during challenging times, and excited about working in a close-knit, growing team. Approachable, empathetic, and proactive in promoting collaboration and innovation.
  • Excels in working independently, demonstrating the ability to accomplish tasks without constant monitoring.
  • Production experience building and maintaining Kubernetes clusters.

Benefits

  • Competitive health, dental & vision coverage with most plans covered at 100% for the employee + any dependents
  • Flexible time off + 15 company holidays including a company-wide holiday break
  • Up to 12 weeks of paid parental leave for both non-birthing and birthing parents, as well as up to 14 weeks of paid pregnancy leave for birthing parents
  • Gym reimbursement ($80 per month)
  • Life & ADD (up to $50K)
  • Short & Long term disability
  • 401K with 4% match
  • Health & Dependent Care FSA Accounts
  • Commuter benefits with $250/month employer contribution
  • Health Savings Account (HSA) with monthly employer contribution
  • Family building benefits through Kindbody
  • Wellbeing benefits (One Medical, Rightway, Headspace)
  • L&D budget of $1,500/year
  • Daily lunch and snacks in office
  • Company retreats

Job title

Senior Site Reliability Engineer

Job type

Experience level

Senior

Salary

$165,000 - $225,000 per year

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job