Efficient Communications in Training Large Scale Neural Networks

Linnan Wang, Wei Wu, George Bosilca, Richard Vuduc, Zenglin Xu

Nov 02, 2016 (modified: Dec 30, 2016) ICLR 2017 conference submission readers: everyone
  • Abstract: We consider the problem of how to reduce the cost of communication that is re- quired for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires a many collective communication operations, like broadcasts of parameters or reduc- tions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P , where P is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like O(log P ). LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.
  • TL;DR: Tackle the communications in the parallel training of neural networks
  • Keywords: Applications, Deep learning
  • Conflicts: cs.cmu.edu, microsoft.com, gatech.edu, utk.edu