Keywords: Graph Neural Network, Distributed System, Graph Partition, Heterogeneous System
TL;DR: HsysGNN is a heterogeneity-aware GNN training framework that partitions workloads via a computation–communication topology graph, with two-level CPU–GPU caching and pipelining to cut communication, enabling faster training with no loss in accuracy.
Abstract: With the rapid evolution of GPUs, heterogeneous GPU environments have become increasingly common. However, most existing distributed graph neural network (GNN) training frameworks are designed for homogeneous settings, where discrepancies in GPU performance often exacerbate load imbalance. In this work, we propose a distributed training method tailored for heterogeneous GPU environments. We model GPUs and their interconnects as a computation–communication topology graph, which guides the partitioning of subgraphs such that each GPU is assigned a workload proportional to its computational power and communication bandwidth, thereby achieving balanced utilization across devices. Furthermore, we design a two-level CPU–GPU caching strategy and a pipeline-parallel execution scheme to further reduce inter-partition communication overhead. Experimental results show that, compared with existing approaches, our method significantly improves training performance, while maintaining model accuracy within acceptable bounds and even achieving slight improvements in some cases.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 23826
Loading