Abstract: In the context of distributed deep learning, computational clusters place greater emphasis on static all-reduce communication latency while also needing to support large-scale networking. However, the communication efficacy of current all-reduce algorithms within specialized network topologies requires enhancement. Existing all-reduce communication algorithms inadequately exploit cluster bandwidth, leading to considerable bandwidth idleness. The optimization of communication algorithms becomes imperative to fully utilize the available bandwidth. Addressing this concern, we propose an innovative approach: a topology-aware interleaved all-reduce algorithm for Dragonfly networks (TIAD). Leveraging the inherent characteristics of the Dragonfly network, TIAD employs an interleaved communication mechanism for both intra- and inter-group data collection, significantly augmenting communication efficiency. Moreover, we refine the Dragonfly network with minimal adjustments, aligning it with the theoretical structure of interleaved communication. We also proposed an all-reduce communication method to complement the TIAD algorithm, specifically for scenarios where only a subset of nodes in the Dragonfly network participate in the communication task. Our experiments demonstrate that TIAD exhibits the shortest communication time across diverse node sizes and bandwidth conditions. Notably, our algorithm reduces communication time by up to 23.4% during the collection communication phase in comparison to the PAARD algorithm.
External IDs:dblp:journals/ton/ZongZTZL25
Loading