HsysGNN: Optimizing Distributed Training of Graph Neural Networks in Heterogeneous Systems

Xianfeng Song; Yi Zou; ZhengShi

HsysGNN: Optimizing Distributed Training of Graph Neural Networks in Heterogeneous Systems

Xianfeng Song, Yi Zou, ZhengShi

20 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Graph Neural Network, Distributed System, Graph Partition, Heterogeneous System

TL;DR: HsysGNN is a heterogeneity-aware GNN training framework that partitions workloads via a computation–communication topology graph, with two-level CPU–GPU caching and pipelining to cut communication, enabling faster training with no loss in accuracy.

Abstract: With the rapid evolution of GPUs, heterogeneous GPU environments have become increasingly common. However, most existing distributed graph neural network (GNN) training frameworks are designed for homogeneous settings, where discrepancies in GPU performance often exacerbate load imbalance. In this work, we propose a distributed training method tailored for heterogeneous GPU environments. We model GPUs and their interconnects as a computation–communication topology graph, which guides the partitioning of subgraphs such that each GPU is assigned a workload proportional to its computational power and communication bandwidth, thereby achieving balanced utilization across devices. Furthermore, we design a two-level CPU–GPU caching strategy and a pipeline-parallel execution scheme to further reduce inter-partition communication overhead. Experimental results show that, compared with existing approaches, our method significantly improves training performance, while maintaining model accuracy within acceptable bounds and even achieving slight improvements in some cases.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 23826

Loading