AdaGK-SGD: Adaptive Global Knowledge Guided Distributed Stochastic Gradient Descent

Hangyu Ye, Weiying Xie, Yunsong Li, Leyuan Fang

Published: 01 Jan 2025, Last Modified: 21 May 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Distributed machine learning (DML) is promising for training large models on large datasets. In DML, multiple workers collaborate on the training of neural networks, significantly reducing the time required for neural network training. The efficiency of DML is heavily influenced by communication, making it crucial to balance the trade-off between communication cost and model performance in current research. Local methods are excellent at reducing communication costs, yet face degradation in accuracy and generalizability. Indeed, global knowledge is valuable for improving performance in local methods. However, the theoretical analysis of global knowledge validity is lacking, and global knowledge can currently only be used in the global aggregation of local methods due to communication limitations and staleness. To this end, in this paper, we establish the mechanism of global knowledge guidance and propose Adaptive Global Knowledge Guided Distributed Stochastic Gradient Descent (AdaGK-SGD) to extend the guidance of global knowledge to the whole distributed training process without any additional communication. Specifically, we define the maximum lifetime of global knowledge based on the mechanism, and establish a correlation between the maximum lifetime and the validity of global knowledge to circumvent the adverse effects of global knowledge staleness. The Maximum Lifetime of Global Knowledge module of our algorithm can be applied separately to other algorithms. In addition, considering the application, we provide a straightforward and efficient strategy for achieving the maximum lifetime adaptive setting. We establish the convergence rate of AdaGK-SGD for convex and non-convex scenarios. Numerically, we find that AdaGK-SGD can significantly improve the accuracy and generalizability of distributed algorithms compared with existing methods.