Principal Engineer overseeing the architecture, operations, and performance tuning of Ceph storage clusters. Working with world-class engineers to drive innovation in private cloud storage at scale.
Responsibilities
Scale-Out Distributed Storage Architecture
Extensive experience in the design, architecture, and management of scale-out distributed storage systems in large production environments.
Demonstrated expertise in system performance tuning, data durability optimization (replication and/or erasure coding), and lifecycle management for petabyte-scale data deployments.
Proven ability to drive the evaluation, selection, and deployment of best-of-breed software-defined storage (SDS) solutions that meet demanding SLAs for latency, throughput, and availability.
Ceph Storage Architecture & Operations
Architect, deploy, and manage large-scale clusters across multiple production sites.
Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
Define upgrade strategy, cluster augmentation, node rebalancing, and hardware refreshes with minimal downtime.
Own end-to-end lifecycle management of storage clusters, including OS/Kernel tuning, firmware upgrades, and hardware integration.
Large Scale OpenStack Platform Experience
Deep (hands-onº architectural experience with the design, deployment, and management of large-scale OpenStack platforms in production environments.
Requirements
15–18 years of experience in scale-out distributed storage systems, infrastructure engineering, and Linux systems.
10+ years hands-on experience with Ceph, including architecture, operations, and large-scale production support.
Proven experience managing clusters at petabyte scale with high performance and resiliency requirements.
Proficiency in Python and Shell scripting for automation and tooling.
Hands-on experience with configuration management (Ansible, Salt, Puppet) and IaC tools like Terraform.
Knowledge of containerization (Docker, Kubernetes, LXC) and their storage backends (CSI, RBD).
Experience with monitoring and logging stacks (Prometheus, Grafana, ELK, OpenObserve).
Familiarity with cloud platforms (Azure, GCP, OpenStack, AWS) and hybrid cloud storage.
Benefits
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.
Senior Software Engineer developing reliable, scalable, and secure payment flows integrated with core banking. Join Luminor to build a local banking champion in Estonia.
Join Luminor as a Senior Software Engineer designing and building payment solutions in a hybrid role. Contribute to a scalable and secure recurring payments platform using modern technologies.
Senior Software Engineer building modern, resilient payment solutions at Luminor in Latvia. Collaborating in a hybrid setup to design and evolve the recurring payments platform.
Responsable Technique R&D sur des innovations dans le domaine des hautes tensions. SuperGrid Institute facilite la transition énergétique avec des solutions avancées en collaboration avec des acteurs industriels.
Software Engineer designing scalable information retrieval infrastructure for Slack. Collaborating with teams to maintain high availability and build new features.
Software Engineer developing scalable, resilient offline indexing pipelines for Slack's search infrastructure. Collaborating with product engineering to build new features and ensure system reliability.
Senior Systems/Software Engineer designing and developing complex software solutions for HPE's edge - to - cloud offerings. Leading project teams and managing internal and outsourced development partners.
ETL/Data Validation QA professional responsible for validating Informatica - to - Oracle PL/SQL migrations and data accuracy in SAP Commissions. Execute manual and automated tests and manage test cases efficiently.
Senior Software Engineer responsible for designing scalable systems at GEICO. Collaborating across teams while guiding quality practices in a fast - paced environment.
Staff Software Engineer developing reliability software for GM Autonomous Vehicles, collaborating across teams to enhance multi - sensor systems and improve data quality.