Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning With Selective Multicast

Published: 01 Jan 2025, Last Modified: 12 May 2025IEEE Trans. Serv. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent advances in distributed machine learning show theoretically and empirically that, for many models, provided that workers will eventually participate in the synchronizations, $i)$ the training still converges, even if only $p$ workers take part in each round of synchronization, and $ii)$ a larger $p$ generally leads to a faster rate of convergence. These findings shed light on eliminating the bottleneck effects of parameter synchronization in large-scale data-parallel distributed training and have motivated several optimization designs. In this paper, we focus on optimizing the parameter synchronization for peer-to-peer distributed learning, where workers broadcast or multicast their updated parameters to others for synchronization, and propose SelMcast, a suite of expressive and efficient multicast receiver selection algorithms, to achieve the goal. Compared with the state-of-the-art (SOTA) design, which randomly selects exactly $p$ receivers for each worker’s multicast in a bandwidth-agnostic way, SelMcast chooses receivers based on the global view of their available bandwidth and loads, yielding two advantages, i.e., accelerated parameter synchronization for higher utilization of computing resources and enlarged average $p$ values for faster convergence. Comprehensive evaluations show that SelMcast is efficient for both peer-to-peer Bulk Synchronous Parallel (BSP) and Stale Synchronous Parallel (SSP) distributed training, outperforming the SOTA solution significantly.
Loading