Local SGD Meets Asynchrony

Bapi Chatterjee; Vyacheslav Kungurtsev; Dan Alistarh

Local SGD Meets Asynchrony

Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: SGD, Data-parallel, Asynchronous, Optimization, Non-convex, Deep Neural Network

Abstract: Distributed variants of stochastic gradient descent (SGD) are central to training deep neural networks on massive datasets. Several scalable versions of data-parallel SGD have been developed, leveraging asynchrony, communication-compression, and local gradient steps. Current research seeks a balance between distributed scalability--seeking to minimize the amount of synchronization needed--and generalization performance--seeking to achieve the same or better accuracy relative to the sequential baseline. However, a key issue in this regime is largely unaddressed: if ``local" data-parallelism is aggressively applied to better utilize the computing resources available with workers, generalization performance of the trained model degrades. In this paper, we present a method to improve the "local scalability" of decentralized SGD. In particular, we propose two key techniques: (a) shared-memory based asynchronous gradient updates at decentralized workers keeping the local minibatch size small, and (b) an asynchronous non-blocking in-place averaging overlapping the local updates, thus essentially utilizing all compute resources at all times without the need for large minibatches. Empirically, the additional noise introduced in the procedure proves to be a boon for better generalization. On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions. On the practical side, we show that it improves upon the performance of local SGD and related schemes, without compromising accuracy.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: A new variant of decentralized distributed SGD to train deep neural netowrks

Reviewed Version (pdf): /references/pdf?id=CKCAio8mmI

7 Replies

Loading