SPA: SCALING GRAPH NEURAL NETWORK TRAINING ON LARGE GRAPHS VIA PROBABILISTIC SPLITTING

Published: 11 Feb 2025, Last Modified: 13 May 2025MLSys 2025 withshepherdingEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Graph Neural Networks, Distributed GNN Training
TL;DR: Scaling graph neural networks by eliminating redundant work using novel partitioning scheme which probabilistically minimizes communication overheads.
Abstract: Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. Data parallel approaches contain redundant work as subgraphs sampled by different GPUs contain significant overlap. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant work by splitting the sampling, loading and training of each mini-batch across multiple GPUs. Split parallelism however introduces communication overheads that can be more than the savings from removing redundant work. We further present a lightweight partitioning algorithm that probabilistically minimizes these overheads. We implement split parallelism in Spa and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and P3.
Submission Number: 273
Loading