Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

TMLR Paper1264 Authors

12 Jun 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the increasing demand for large-scale training of machine learning models, fully decentralized optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase is sensitive to stragglers. An efficient way to mitigate this effect is to consider asynchronous updates, where each worker computes stochastic gradients and communicates with other workers at its own pace. Unfortunately, fully asynchronous updates suffer from staleness of the stragglers' parameters. To address these limitations, we propose a fully decentralized algorithm DSGD-AAU with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). Experimental results on a suite of datasets and deep neural network models are provided to verify our theoretical results.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Thank you for all reviewers' constructive feedback and comments. We provided a point-to-point response to all comments/suggestions from the reviewers. In our response, we detailed the changes and updates we made in our revised paper per the reviewers' suggestions. Here we briefly summarize them and kindly refer you to the details in our response to each reviewer. - We clarify the definitions (e.g., iteration, asynchronous updates) in our proposed algorithm, as well as the key design insights and algorithm explanation. - We further improve the theoretical results (condition on $K$) - We provide further details about our experiments and settings.
Assigned Action Editor: ~Virginia_Smith1
Submission Number: 1264
Loading