GriNNder: Large-Scale Full-Graph Training of Graph Neural Networks on a Single GPU with Storage

Jaeyong Song; Seongyeon Park; Hunseong Lim; Jaewon Jung; Junguk Hong; Hongsun Jang; Jinho Lee

GriNNder: Large-Scale Full-Graph Training of Graph Neural Networks on a Single GPU with Storage

Jaeyong Song, Seongyeon Park, Hunseong Lim, Jaewon Jung, Junguk Hong, Hongsun Jang, Jinho Lee

02 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Graph Neural Networks, Deep Learning Systems, GNN Training Frameworks

TL;DR: This paper enables high throughput and previously infeasible large-scale full-graph training with a single GPU utilizing storage devices.

Abstract: Full-graph training of graph neural networks (GNNs) processes the entire graph at once, preserving all input information and enabling straightforward validation of algorithmic gains. However, it typically needs multiple GPUs/servers, increasing costs and inter-server communication. Although single-server methods reduce expenses, they remain constrained by limited GPU/host memory as graph sizes grow. Furthermore, naïvely applying storage-based methods from other domains to mitigate such a limit is infeasible for handling large-scale graphs. Here, we introduce GriNNder, the first storage-based framework (e.g., using NVMe SSDs) for scalable and efficient full-graph GNN training. GriNNder alleviates GPU memory bottlenecks by offloading data to storage, while keeping read/write traffic to and from the storage device minimal. To achieve this, from the observation that cross-partition dependencies follow a power-law distribution, we introduce an efficient partition-wise caching strategy for handling intermediate activations/gradients of full-graph dependencies with host memory. Also, we design a regathering mechanism for the gradient engine that minimizes storage traffic and propose a lightweight partitioning scheme that overcomes the memory limitations of existing methods. GriNNder achieves up to 9.78$\times$ speedup over the state-of-the-art baseline and comparable throughput to distributed baselines while enabling previously infeasible large-scale full-graph training with a single GPU.

Supplementary Material: zip

Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)

Submission Number: 5867

Loading