DyGNeX : Efficient Distributed Training of Dynamic Graph Neural Networks with Cross-Time-Window Scheduling

Zihao Fan; Yunzhuo Liu; Tian Guo; Zhi Han; Bo Jiang

DyGNeX : Efficient Distributed Training of Dynamic Graph Neural Networks with Cross-Time-Window Scheduling

Zihao Fan, Yunzhuo Liu, Tian Guo, Zhi Han, Bo Jiang

27 Sept 2024 (modified: 18 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dynamic Graph Neural Networks, Distributed training, load balance

TL;DR: This paper studies a distributed training system for DGNNs with load balancing.

Abstract: Dynamic Graph Neural Networks (DGNNs) are advanced methods for processing evolving graph data, capturing both structural and temporal dependencies efficiently. However, existing distributed DGNN training methods face challenges in achieving load balance across GPUs and minimizing communication overhead, which limits their efficiency. In this paper, we introduce DyGNeX, a distributed training system designed to address this issue. DyGNeX utilizes a cross-time-window snapshot group scheduling algorithm that balances computational loads across GPUs without introducing additional cross-GPU feature aggregation or hidden state communication. Based on the specific scenario, the scheduling algorithm is applied using greedy or Integer Linear Programming (ILP) methods, referred to as DyGNeX-G and DyGNeX-L, respectively. DyGNeX-L and DyGNeX-G achieve average reductions of 28\% and 24\% in per-epoch training time compared to state-of-the-art methods, maintaining load imbalance across GPUs at approximately 4\% and 8\%, while preserving model convergence across various DGNN models and datasets. In simulation experiments, as the number of GPUs increases, DyGNeX-G shows good scalability, efficiently handling clusters with up to 512 GPUs while maintaining 95\% efficiency.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10432

Loading