Combining Global Sparse Gradients with Local Gradients

Alham Fikri Aji; Kenneth Heafield

Combining Global Sparse Gradients with Local Gradients

Alham Fikri Aji, Kenneth Heafield

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Withdrawn SubmissionReaders: Everyone

Abstract: Data-parallel neural network training is network-intensive, so gradient dropping was designed to exchange only large gradients. However, gradient dropping has been shown to slow convergence. We propose to improve convergence by having each node combine its locally computed gradient with the sparse global gradient exchanged over the network. We empirically confirm with machine translation tasks that gradient dropping with local gradients approaches convergence 48% faster than non-compressed multi-node training and 28% faster compared to vanilla gradient dropping. We also show that gradient dropping with a local gradient update does not reduce the model's final quality.

Keywords: Distributed training, stochastic gradient descent, machine translation

TL;DR: We improve gradient dropping (a technique of only exchanging large gradients on distributed training) by incorporating local gradients while doing a parameter update to reduce quality loss and further improve the training time.

4 Replies

Loading