From Sancus to Sancus: staleness and quantization-aware full-graph decentralized training in graph neural networks

Published: 01 Jan 2025, Last Modified: 13 May 2025VLDB J. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Graph neural networks (GNNs) have emerged due to their success at modeling graph data. Yet, it is challenging for GNNs to efficiently scale to large graphs. Thus, distributed GNNs come into play. To avoid communication caused by expensive data movement between workers, we propose Sancus and its advanced version Sancus, the staleness and quantization-aware communication-avoiding decentralized GNN system. By introducing a set of novel bounded embedding staleness metrics and adaptively skipping broadcasts, Sancus abstracts decentralized GNN processing as sequential matrix multiplication and uses historical embeddings via cache. To further mitigate the communication volume, Sancus conducts quantization-aware communication on embeddings to reduce the size of broadcast messages. Theoretically, we show bounded approximation errors of embeddings and gradients with a known fastest convergence guarantee. Empirically, we evaluate Sancus and Sancus with common GNN models via different system setups on large-scale benchmark datasets. Compared to SOTA works, Sancus can avoid up to \(86\%\) communication with \(3.0\times \) faster throughput on average without accuracy loss.
Loading