Asynchronous SGD without gradient delay for efficient distributed training


Sep 27, 2018 (modified: Oct 10, 2018) ICLR 2019 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Asynchronous distributed gradient descent algorithms for training of deep neural networks are usually considered as inefficient, mainly because of the Gradient delay problem. In this paper, we propose a novel asynchronous distributed algorithm that tackles this limitation by well-thought-out averaging of model updates, computed by workers. The algorithm allows computing gradients along the process of gradient merge, thus, reducing or even completely eliminating worker idle time due to communication overhead, which is a pitfall of existing asynchronous methods. We provide theoretical analysis of the proposed asynchronous algorithm, and show its regret bounds. According to our analysis, the crucial parameter for keeping high convergence rate is the maximal discrepancy between local parameter vectors of any pair of workers. As long as it is kept relatively small, the convergence rate of the algorithm is shown to be the same as the one of a sequential online learning. Furthermore, in our algorithm, this discrepancy is bounded by an expression that involves the staleness parameter of the algorithm, and is independent on the number of workers. This is the main differentiator between our approach and other solutions, such as Elastic Asynchronous SGD or Downpour SGD, in which that maximal discrepancy is bounded by an expression that depends on the number of workers, due to gradient delay problem. To demonstrate effectiveness of our approach, we conduct a series of experiments on image classification task on a cluster with 4 machines, equipped with a commodity communication switch and with a single GPU card per machine. Our experiments show a linear scaling on 4-machine cluster without sacrificing the test accuracy, while eliminating almost completely worker idle time. Since our method allows using commodity communication switch, it paves a way for large scale distributed training performed on commodity clusters.
  • Keywords: SGD, distributed asynchronous training, deep learning, optimisation
  • TL;DR: A method for an efficient asynchronous distributed training of deep learning models along with theoretical regret bounds.
0 Replies