Scalable and Load-Balanced Full-Graph GNN Training on Multiple GPUs

Qiange Wang, Yao Chen, Weng-Fai Wong, Bingsheng He

Published: 2025, Last Modified: 21 Jan 2026IEEE Trans. Knowl. Data Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present $\mathsf {HongTu }$, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. $\mathsf {HongTu }$ offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, $\mathsf {HongTu }$ utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, $\mathsf {HongTu }$ employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, $\mathsf {HongTu }$ tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that $\mathsf {HongTu }$ can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, $\mathsf {HongTu }$ achieves speedups ranging from 11.4× to 21.3×.

External IDs:dblp:journals/tkde/WangCWH25