Expediting Distributed GNN Training with Feature-only Partition and Optimized Communication Planning
Abstract: Feature-only partition of large graph data in distributed Graph Neural Network (GNN) training offers advantages over commonly adopted graph structure partition, such as minimal graph preprocessing cost and elimination of cross-worker subgraph sampling burdens. Nonetheless, performance bottleneck of GNN training with feature-only partitions still largely lies in the substantial communication overhead due to cross-worker feature fetching. To reduce the communication overhead and expedite distributed training, we first investigate and answer two key questions on convergence behaviors of GNN model in feature-partition based distribute GNN training: 1) As no worker holds a complete copy of each feature, can gradient exchange among workers compensate for the information loss due to incomplete local features? 2) If the answer to the first question is negative, is feature fetching in every training iteration of the GNN model necessary to ensure model convergence? Based on our theoretical findings on these questions, we derive an optimal communication plan that decides the frequency for feature fetching during the training process, taking into account bandwidth levels among workers and striking a balance between model loss and training time. Extensive evaluation demonstrates consistent results with our theoretical analysis, and the effectiveness of our proposed design.
Loading