Delay-Tolerant Local SGD for Efficient Distributed TrainingDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: Delay-tolerant, communication-efficient, distributed learning
Abstract: The heavy communication for model synchronization is a major bottleneck for scaling up the distributed deep neural network training to many workers. Moreover, model synchronization can suffer from long delays in scenarios such as federated learning and geo-distributed training. Thus, it is crucial that the distributed training methods are both \textit{delay-tolerant} AND \textit{communication-efficient}. However, existing works cannot simultaneously address the communication delay and bandwidth constraint. To address this important and challenging problem, we propose a novel training framework OLCO\textsubscript{3} to achieve delay tolerance with a low communication budget by using stale information. OLCO\textsubscript{3} introduces novel staleness compensation and compression compensation to combat the influence of staleness and compression error. Theoretical analysis shows that OLCO\textsubscript{3} achieves the same sub-linear convergence rate as the vanilla synchronous stochastic gradient descent (SGD) method. Extensive experiments on deep learning tasks verify the effectiveness of OLCO\textsubscript{3} and its advantages over existing works.
One-sentence Summary: We propose a delay-tolerant AND communication-efficient training method for distributed learning.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf):
5 Replies