Track: Systems and infrastructure for Web, mobile, and WoT
Keywords: Communication hierarchy, message aggregation, communication domain, graph processing, Graph500
TL;DR: Communication Hierarchy-aware Graph Engine for Distributed Model Training
Abstract: Efficient processing of large-scale graphs with billions to trillions of edges is essential for training graph-based large language models
(LLMs) in web-scale systems. The increasing complexity and size of these models create significant communication challenges due to
the extensive message exchanges required across distributed nodes. Current graph engines struggle to effectively scale across hundreds
of computing nodes because they often overlook variations in communication costs within the interconnection hierarchy. To address
this challenge, we introduce TuComm, a communication hierarchy-aware engine specifically designed to optimize distributed training
of graph-based LLMs. By leveraging hierarchical network topology, TuComm dynamically aggregates and transfers messages, fully
accounting for the underlying communication domains, thereby enhancing the efficiency of distributed model training across large-scale systems. We implemented TuComm on top of the message passing interface (MPI), incorporating innovations such as dynamic
buffer expansion and active buffer switching to enhance scalability. Evaluations conducted on synthetic and real-world datasets,
utilizing up to 79,024 nodes and over 1.2 million processor cores, demonstrate that TuComm surpasses leading graph-parallel systems and state-of-the-art counterparts in both throughput and scalability. Moreover, we have deployed TuComm on a production supercomputer, where it consistently outperforms top solutions on the Graph500 list. These results highlight TuComm’s potential
to significantly enhance the efficiency of distributed large-scale graph-based LLM training by optimizing communication among
distributed systems, making it an invaluable communication engine for web-scale model training.
Submission Number: 2444
Loading