Runpod, Inc. logo

Runpod, Inc.

Manager, HPC Storage Engineer at Runpod, Inc.

Remote, USAFull-timeRemoteEngineeringPosted 15 days ago

About the Role

<p>Runpod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full‑stack AI applications. Founded in 2022, we are a rapidly growing, well‑funded, remote‑first company with a global team across the US, Canada, and Europe. Our mission is to create a foundational platform that enables developers and companies to build, deploy, and scale custom AI systems with speed and flexibility.</p> <p>As AI workloads continue to push the limits of throughput, latency, and parallelism, Runpod is investing heavily in next-generation storage architectures purpose-built for GPU-centric compute.</p> <p>We are looking for an Engineering Manager, Datacenter Storage Engineering to lead the team responsible for Runpod’s distributed storage infrastructure across all regions. This role owns the end-to-end storage stack — from NAND and NVMe devices through filesystems, transport protocols, and cluster-level deployment — ensuring performance, reliability, and scalability for AI workloads.</p> <p>You will manage engineers designing and operating large-scale SAN and NFS-based systems, including high-performance shared filesystems for training workloads. This role requires deep technical fluency and architectural leadership, combined with strong people management and operational discipline.</p> <h3><strong>Responsibilities</strong></h3> <ul> <li><strong>Own Distributed Storage Architecture:</strong> Define, evolve, and operate Runpod’s global storage platforms, supporting training, inference, checkpointing, and dataset access at scale.</li> <li><strong>Build the Storage Engineering Team:</strong> Manage and grow a team of storage and systems engineers. Set clear ownership, technical direction, and operational standards across regions.</li> <li><strong>High-Performance Shared Filesystems:</strong> Design and operate large-scale <strong>SAN and NFS deployments</strong>, including performance-sensitive shared storage for GPU clusters.=</li> <li><strong>Advanced Filesystems &amp; Platforms:</strong> Lead deployments and operations of <strong>VAST Data</strong> and experience with <strong>Lustre or similar parallel filesystems</strong> used in HPC and AI environments.</li> <li><strong>End-to-End Performance Ownership:</strong> Drive performance optimization from <strong>NAND and NVMe media</strong> through controllers, networking, and client access patterns.</li> <li><strong>Next-Generation Storage Technologies:</strong> Evaluate and deploy cutting-edge capabilities such as <strong>NFS over RDMA, GPU Direct Storage (GDS)</strong>, and low-latency data paths for accelerated workloads.</li> <li><strong>Reliability &amp; Scale:</strong> Establish best practices for replication, data tiering, data protection, failure recovery, capacity planning, and lifecycle management.</li> <li><strong>Automation &amp; Observability:</strong> Build automation for provisioning, expansion, upgrades, and monitoring. Ensure deep observability into throughput, latency, and error characteristics.</li> <li><strong>Cross-Functional Collaboration:</strong> Partner with Datacenter Networking, GPU Platform, SRE, and Product teams to ensure storage systems meet evolving workload and customer needs.</li> <li><strong>Vendor &amp; Partner Management:</strong> Own technical relationships with storage vendors, hardware partners, and colocation providers; drive roadmap alignment and issue resolution.<br><br></li> </ul> <h3><strong>Requirements</strong></h3> <ul> <li><strong>Engineering Leadership Experience:</strong> 3+ years managing storage, systems, or infrastructure engineering teams in production environments.</li> <li><strong>Distributed Storage Expertise:</strong> 8+ years designing and operating large-scale storage systems, including <strong>SAN and NFS architectures</strong> at multi-petabyte scale.</li> <li><strong>VAST Data Experience:</strong> Hands-on experience deploying, operating, or deeply integrating <strong>VAST Data</strong> in production environments is required.</li> <li><strong>Parallel Filesystems:</strong> Experience with <strong>Lustre or comparable HPC filesystems</strong> (e.g., GPFS, BeeGFS) supporting high-concurrency workloads.</li> <li><strong>Low-Level Storage Knowledge:</strong> Deep understanding of <strong>NAND, NVMe, PCIe, storage controllers</strong>, and performance characteristics across the stack.</li> <li><strong>High-Performance Data Paths:</strong> Proven experience with <strong>NFS over RDMA, RDMA-capable transports</strong>, or similar technologies. Familiarity with <strong>GPU Direct Storage</strong> strongly preferred.</li> <li><strong>Linux Systems Expertise:</strong> Strong Linux internals knowledge, including filesystems, I/O scheduling, memory management, and tuning for performance workloads.</li> <li><strong>Operational Excellence:</strong> Experience running 24/7 storage platforms with strong incident response, change management, and post-mortem discipline.</li> <li><strong>Communication &amp; Leadership:</strong> Ability to clearly communicate complex technical tradeoffs and lead teams through high-stakes infrastructure decisions.</li> <li><strong>Successful completion of a background check.</strong><strong><br></strong></li> </ul> <h3><strong>Preferred Qualifications</strong></h3> <ul> <li>Experience supporting AI training pipelines, large-scale model checkpointing, and dataset streaming workloads.</li> <li>Familiarity with RDMA fabrics and close collaboration with datacenter networking teams.</li> <li>Experience designing storage systems for multi-tenant isolation and secure data access.</li> <li>Background in hyperscale, HPC, or AI-focused infrastructure environments.</li> <li>Experience building internal storage platforms or abstractions consumed by product teams.</li> </ul> <p><strong>What You’ll Receive:</strong></p> <ul> <li>The competitive base pay for this position ranges from $150,000 - $240,000 USD. This salary range may be inclusive of several career levels at Runpod and will be narrowed during the interview process based on a number of factors, including the candidate’s experience, qualifications, and location</li> <li>Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.</li> <li>Generous medical, dental &amp; vision plans&nbsp;</li> <li>Flexible PTO- take the time you need to recharge</li> <li>Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication&nbsp;</li> <li>Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.</li> </ul>