FedGSync: Jointly Optimized Weak Synchronization and Gradient Transmission for Fast Distributed Machine Learning in Heterogeneous WAN
Abstract: Due to privacy and cost reasons, distributed machine learning in Wide-Area Networks(DML-WAN) is becoming an emerging and popular collaborative learning paradigm. However, heterogeneity in computing power and data distribution among workers in different locations has a dramatic impact on training performance, including convergence speed and learning accuracy. Most of the existing works on distributed training mechanisms either focus on computing heterogeneity or data heterogeneity, and none of them can handle both well. In this paper, we propose FedGSync, a novel distributed training mechanism to improve the training performance for DML-WAN, where computing heterogeneity and data heterogeneity usually coexist. To speed up training and improve model accuracy, FedGSync clusters workers into groups according to the similarity of their data distribution and introduce group-based weak synchronization to minimize the synchronization delays waiting for slow workers and the accuracy loss by balancing the contributions of all data distributions. To preserve data privacy and improve efficiency, FedGSync only groups workers based on principal components of gradients and design an approximate grouping mechanism based on Kmeans. To further reduce synchronization time, FedGSync prioritizes packets and uses differential transmission for gradient packets between groups. Evaluation results demonstrate that FedGSync improves convergence speed and learning accuracy under the coexistence of computing heterogeneity and data heterogeneity compared with state-of-the-art distributed training mechanisms.
Loading