Hybrid Senior Storage Production Engineer – DGX Cloud

Posted last week

Apply now

About the role

  • Production Engineer developing and maintaining large-scale storage systems for NVIDIA's GPU cloud services. Focusing on optimizing performance, scalability, and reliability of storage infrastructures.

Responsibilities

  • Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
  • Develop and maintain storage monitoring, logging, and alerting systems to ensure proactive detection and resolution of performance issues.
  • Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
  • Improve the lifecycle of storage services – from inception and design to deployment, operation, and continuous optimization.
  • Support storage services before they go live through activities such as system design consulting, developing automation frameworks, capacity management, and launch reviews.
  • Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
  • Optimize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload placement.
  • Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
  • Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
  • Practice sustainable incident response and blameless root cause analysis.
  • Be part of an on-call rotation to support storage and production systems.

Requirements

  • BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
  • Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
  • Solid understanding of block, file, and object storage technologies, including their scalability, reliability, and performance characteristics and standard processes.
  • Experience with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics.
  • Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
  • Experience in one or more of the following: C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning.
  • Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
  • Experience with observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.

Benefits

  • equity
  • benefits

Job title

Senior Storage Production Engineer – DGX Cloud

Job type

Experience level

Senior

Salary

$168,000 - $270,250 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job