Production Engineer developing and maintaining large-scale storage systems for NVIDIA's GPU cloud services. Focusing on optimizing performance, scalability, and reliability of storage infrastructures.
Responsibilities
Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
Develop and maintain storage monitoring, logging, and alerting systems to ensure proactive detection and resolution of performance issues.
Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
Improve the lifecycle of storage services – from inception and design to deployment, operation, and continuous optimization.
Support storage services before they go live through activities such as system design consulting, developing automation frameworks, capacity management, and launch reviews.
Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
Optimize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload placement.
Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
Practice sustainable incident response and blameless root cause analysis.
Be part of an on-call rotation to support storage and production systems.
Requirements
BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
Solid understanding of block, file, and object storage technologies, including their scalability, reliability, and performance characteristics and standard processes.
Experience with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics.
Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
Experience in one or more of the following: C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning.
Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
Experience with observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.
Senior Production Engineer planning feasibility studies and ensuring technical prerequisites in production. Collaborating in product creation and improvement processes within a global clean energy company.
Production Engineer supporting assembly operations in high - mix, low - volume environments. Providing hands - on technical support to resolve production issues and improve assembly methods while collaborating with shop floor personnel.
Production Engineering Manager at Boeing leading teams for high - precision automation challenges and driving production asset reliability. Ensure functionality, reliability, and compliance at Auburn & Frederickson sites.
Production Engineering Manager at Fervo Energy optimizing geothermal asset performance. Leading engineering activities and establishing performance standards for operational efficiency and safety.
Ingeniero/a de Sistemas / Backend / Producción buscando talento técnico en entornos críticos y automatización en Colombia. Requiere inglés B2 y habilidades técnicas en C++, SQL, y más.
Refinery Production Engineer driving operational performance and process management in manufacturing plants. Collaborating with multiple functions for production quality and efficiency improvements.
Ingénieur systèmes chez Klee Group, responsable de l'exploitation des plateformes clients. Travaillant dans un environnement technique hétérogène avec des équipes expérimentées.
Production Engineer controlling processes and analyzing losses at SONDA. Supporting logistics operations and developing systems in a collaborative work environment.
Java Developer elevating technical leadership and engineering capabilities in Microservices for ANZ. Engaging in development, testing, and solution design with a collaborative team approach.
Production Engineer responsible for optimizing manufacturing processes, ensuring product quality and driving operational improvements in a manufacturing setting. Collaborating with product development teams and managing engineering projects.