Senior Site Reliability Engineer ensuring reliability and scalability in Stellar's blockchain technology. Collaborating on AWS/GCP infrastructure and Kubernetes management.
Responsibilities
Maintain, improve, scale and secure our AWS/GCP infrastructure and Linux systems.
Assist our development teams in running, packaging, deploying and troubleshooting applications
Work with developers on streamlining deployment processes with Jenkins and other CI/CD tooling.
Build, maintain, monitor and improve our Kubernetes clusters.
Work with development teams on migrating applications to Kubernetes.
Be responsible for maintenance and improvements to multiple internal services, for example Kubernetes, Prometheus, ELK.
Monitor, triage and respond to alerts in our high availability environments.
Participate in design and code reviews, and ensure that the foundation for our services is best in class.
Evaluate new technologies, design and implement as appropriate.
Identify automation opportunities and implement by creating custom or by using off the shelf solutions.
Requirements
5+ years of experience of working in cloud-based systems operations, as a SRE or DevOps engineer.
First-hand experience with configuration management and infrastructure as code (Ansible, Puppet, Terraform).
Proficient in utilizing SRE methodologies like capacity planning and disaster recovery testing to ensure the scalability, resilience, and availability of critical services.
A strong understanding of computer networking, TCP/UDP, load balancing, distributed computing, web services, and the fundamental protocols used by the internet (HTTP, HTTPS, DNS, etc.).
Experienced in managing production workloads and skilled in using monitoring tools to detect issues early.
Comfortable with participating in on-call rotations and conducting thorough root cause analyses to keep systems running smoothly.
Proficiency in at least one programming language.
Committed to supporting teammates, especially during challenging times, and excited about working in a close-knit, growing team. Approachable, empathetic, and proactive in promoting collaboration and innovation.
Excels in working independently, demonstrating the ability to accomplish tasks without constant monitoring.
Production experience building and maintaining Kubernetes clusters.
Benefits
Competitive health, dental & vision coverage with most plans covered at 100% for the employee + any dependents
Flexible time off + 15 company holidays including a company-wide holiday break
Up to 12 weeks of paid parental leave for both non-birthing and birthing parents, as well as up to 14 weeks of paid pregnancy leave for birthing parents
Gym reimbursement ($80 per month)
Life & ADD (up to $50K)
Short & Long term disability
401K with 4% match
Health & Dependent Care FSA Accounts
Commuter benefits with $250/month employer contribution
Health Savings Account (HSA) with monthly employer contribution
Frontend Developer creating high - performance web applications focused on interactive data visualizations. Collaborating with experts to deliver user - centric solutions from idea to production.
Senior Site Reliability Engineer ensuring system reliability and scalability at Stellar Development Foundation. Collaborating on cloud infrastructure, Kubernetes, and operational excellence for blockchain technology.
DevOps Engineer ensuring reliable operations for SaaS solutions at INFORM. Focus on CI/CD, cloud infrastructure, and service automation in the Risk & Fraud business unit.
Senior DevOps Engineer at Twin Harbour Interactive developing and maintaining high availability systems with a focus on optimization and tooling. Collaborating with game development and product teams in a hybrid work environment.
Site Reliability Engineer ensuring performance, scalability, and security of production environments at FIS. Collaborating on resilient, self - service platforms for fintech solutions.
HPC Storage Dev Ops Engineer identifies and optimizes storage solutions at Intel. Overseeing installation and performance to ensure data integrity and compliance with regulations.
DevOps Engineer working on Linux - based infrastructure focusing on automation with tools like Ansible and Terraform. Engaging in international projects and ensuring optimal system operations.
DevOps Manager leading a distributed team managing L3 support for vision AI solutions. Overseeing operations for Edge/on - prem and cloud platforms at Everseen.
Senior Site Reliability Engineer ensuring reliability of applications across AWS infrastructure at Onit. Collaborating with teams to troubleshoot and optimize system performance.
Chassis Engineer leading Brake system design for Ford Racing. Focused on delivering performance vehicle solutions through innovative design and collaboration with teams.