Abstract: Communication is a key bottleneck for distributed graph neural network (GNN) training. Existing GNN training systems fail to scale to deep GNNs because of the tremendous amount of inter-GPU communication. This paper proposes Mithril, a new approach that significantly scales the distributed full-graph deep GNN training. Being the first to use layer-level model parallelism for GNN training, Mithril partitions GNN layers among GPUs, each device performs the computation for a disjoint subset of consecutive GNN layers on the whole graph. Compared to graph parallelism with each GPU handling a graph partition, Mithril reduces the communication volume by a factor of the number of GNN layers to scale to deep models. Mithril overcomes the unique challenges for pipelined layer-level model parallelism on the whole graph by partitioning it into dependent chunks, breaking the dependencies with embedding speculation, and applying specific training techniques to ensure convergence. We also propose a hybrid approach by combining Mithril with graph parallelism to handle large graphs, achieve better computer resource utilization and ensure model convergence. We build a general GNN training system supporting all three parallelism settings. Extensive experiments show that Mithril reduces the perepoch communication volume by up to $22.89 \times$ (on average $6.78 \times$). It achieves a maximum training time speedup of $2.34 \times$ (on average $1.49 \times$) on a GPU cluster with a high-performance InfiniBand network. On another cluster with a commodity Ethernet, Mithril outperforms the baseline by up to $10.21 \times$ (on average $7.16 \times$). Mithril also achieves a comparable level of model accuracy and convergence speed compared to graph parallelism.
Loading