Squeezing SGD Parallelization Performance in Distributed Training Using Delayed AveragingDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: SGD, distributed training, hide communication cost, convergence
Abstract: State-of-the-art deep learning algorithms rely on distributed training to tackle the increasing model size and training data. Mini-batch Stochastic Gradient Descent (SGD) requires workers to halt forward/backward propagations, to wait for gradients synchronized among all workers before the next batch of tasks. The synchronous execution model exposes the overhead of gradient communication among a large number of workers in a distributed training system. To this end, we propose a new SGD algorithm with delayed averaging, namely DaSGD, which can fully parallelize SGD and forward/backward propagations to hide 100\% of gradient communication. By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on high-throughput inter-connects. The theoretical analysis and experimental results conducted in this paper both show its convergence rate of $ O (1 / \sqrt {K} )$ stays the same as Mini-batch SGD. A analytical model shows that it enables linear performance scalability with the cluster size.
5 Replies

Loading