Asynchronous SGD without gradient delay for efficient distributed training

Roman Talyansky; Pavel Kisilev; Zach Melamed; Natan Peterfreund; Uri Verner

Asynchronous SGD without gradient delay for efficient distributed training

Roman Talyansky, Pavel Kisilev, Zach Melamed, Natan Peterfreund, Uri Verner

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Asynchronous distributed gradient descent algorithms for training of deep neural networks are usually considered as inefficient, mainly because of the Gradient delay problem. In this paper, we propose a novel asynchronous distributed algorithm that tackles this limitation by well-thought-out averaging of model updates, computed by workers. The algorithm allows computing gradients along the process of gradient merge, thus, reducing or even completely eliminating worker idle time due to communication overhead, which is a pitfall of existing asynchronous methods. We provide theoretical analysis of the proposed asynchronous algorithm, and show its regret bounds. According to our analysis, the crucial parameter for keeping high convergence rate is the maximal discrepancy between local parameter vectors of any pair of workers. As long as it is kept relatively small, the convergence rate of the algorithm is shown to be the same as the one of a sequential online learning. Furthermore, in our algorithm, this discrepancy is bounded by an expression that involves the staleness parameter of the algorithm, and is independent on the number of workers. This is the main differentiator between our approach and other solutions, such as Elastic Asynchronous SGD or Downpour SGD, in which that maximal discrepancy is bounded by an expression that depends on the number of workers, due to gradient delay problem. To demonstrate effectiveness of our approach, we conduct a series of experiments on image classification task on a cluster with 4 machines, equipped with a commodity communication switch and with a single GPU card per machine. Our experiments show a linear scaling on 4-machine cluster without sacrificing the test accuracy, while eliminating almost completely worker idle time. Since our method allows using commodity communication switch, it paves a way for large scale distributed training performed on commodity clusters.

Keywords: SGD, distributed asynchronous training, deep learning, optimisation

TL;DR: A method for an efficient asynchronous distributed training of deep learning models along with theoretical regret bounds.

4 Replies

Loading