Abstract: Graph Neural Networks (GNNs) have emerged as the state-of-the-art method for graph-based learning tasks. However, training GNNs at scale remains challenging, limiting the exploration of more sophisticated GNN architectures and their application to large real-world graphs. In distributed GNN training, communication overhead and waiting times have become major performance bottlenecks. To address these challenges, we propose PipeQS, an adaptive quantization and staleness-aware pipeline distributed training system for GNNs. PipeQS dynamically adjusts the bit-width of message quantization and manages staleness to reduce both communication overhead and communication waiting time. By detecting pipeline bottlenecks caused by synchronization and utilizing cached communication to bypass message delays, PipeQS significantly improves training efficiency. Experimental results validate the effectiveness of PipeQS, showing up to an 8.3\( \times \) improvement in throughput while maintaining full-graph accuracy. Furthermore, our theoretical analysis demonstrates fast convergence at a rate of \(O(T^{ - \frac{1}{2}})\), where T is the total number of training epochs. PipeQS achieves a well-balanced trade-off between training speed and accuracy, significantly reducing training time without compromising performance. The code is available at https://github.com/suupahako/PipeQS-code
External IDs:dblp:conf/pkdd/WuSJLL25
Loading