Leveraging Asynchronicity in Gradient Descent for Scalable Deep Learning

Jeff Daily, Abhinav Vishnu, Charles Siegel

Nov 04, 2016 (modified: Nov 04, 2016) ICLR 2017 conference submission readers: everyone
  • Abstract: In this paper, we present multiple approaches for improving the performance of gradient descent when utilizing mutiple compute resources. The proposed approaches span a solution space ranging from equivalence to running on a single compute device to delaying gradient updates a fixed number of times. We present a new approach, asynchronous layer-wise gradient descent that maximizes overlap of layer-wise backpropagation (computation) with gradient synchronization (communication). This approach provides maximal theoretical equivalence to the de facto gradient descent algorithm, requires limited asynchronicity across multiple iterations of gradient descent, theoretically improves overall speedup, while minimizing the additional space requirements for asynchronicity. We implement all of our proposed approaches using Caffe – a high performance Deep Learning library – and evaluate it on both an Intel Sandy Bridge cluster connected with InfiniBand as well as an NVIDIA DGX-1 connected with NVLink. The evaluations are performed on a set of well known workloads including AlexNet and GoogleNet on the ImageNet dataset. Our evaluation of these neural network topologies indicates asynchronous gradient descent has a speedup of up to 1.7x compared to synchronous.
  • TL;DR: Overlapping communication and computation for distributed gradient descent.
  • Keywords: Deep learning
  • Conflicts: wsu.edu, pnnl.gov