CaliEX: A Disk-Based Large-Scale GNN Training System with Joint Design of Caching and Execution

Can Su, Haipeng Zhang, Hanyu Zhao, Wenting Shen, Baole Ai, Yong Li, Kaigui Bian, Bin Cui

Published: 01 Jan 2025, Last Modified: 06 Nov 2025ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Graph neural networks (GNNs) have proven to be powerful tools for learning from graph-structured data and have achieved great success in many applications. As the sizes of real-world graphs continue to grow, traditional GNN training methods face significant scalability challenges. Recently, disks have gained attention as a cost-effective solution to store large-scale graphs, and several disk-based GNN systems have been proposed to train large-scale graphs on a single machine. However, these systems either overlook the unique data characteristics of GNN workloads when designing cache plans or fail to fully exploit the multilevel hierarchy of storage and computation in system execution, thus resulting in disk I/O bottleneck and resource under-utilization. To address these issues, we present CaliEX, an advanced disk-based GNN system that employs joint optimizations of caching and execution within and across different training stages. CaliEX first designs tailored cache plans and execution policy for both graph topology and features to accelerate neighborhood sampling and feature gathering. Since these two training stages work on different types of data, CaliEX further auto-tunes the cache allocation and pipelines the execution across different stages to improve resource utilization and overall training throughput. Evaluations on multiple GNN models and various large-scale datasets show that CaliEX achieves 3.28 × speedup on average compared to existing disk-based GNN training systems.

External IDs:dblp:conf/icde/SuZZSALBC25