Compressing gradients in distributed SGD by exploiting their temporal correlation

Tharindu Adikari; Stark Draper

Compressing gradients in distributed SGD by exploiting their temporal correlation

Tharindu Adikari, Stark Draper

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: distributed optimization, gradient compression, error-feedback

Abstract: We propose SignXOR, a novel compression scheme that exploits temporal correlation of gradients for the purpose of gradient compression. Sign-based schemes such as Scaled-sign and SignSGD (Bernstein et al., 2018; Karimireddy et al., 2019) compress gradients by storing only the sign of gradient entries. These methods, however, ignore temporal correlations between gradients. The equality or non-equality of signs of gradients in two consecutive iterations can be represented by a binary vector, which can be further compressed depending on its entropy. By implementing a rate-distortion encoder we increase the temporal correlation of gradients, lowering entropy and improving compression. We achieve theoretical convergence of SignXOR by employing the two-way error-feedback approach introduced by Zheng et al. (2019). Zheng et al. (2019) show that two-way compression with error-feedback achieves the same asymptotic convergence rate as SGD, although convergence is slower by a constant factor. We strengthen their analysis to show that the rate of convergence of two-way compression with errorfeedback asymptotically is the same as that of SGD. As a corollary we prove that two-way SignXOR compression with error-feedback achieves the same asymptotic rate of convergence as SGD. We numerically evaluate our proposed method on the CIFAR-100 and ImageNet datasets and show that SignXOR requires less than 50% of communication traffic compared to sending sign of gradients. To the best of our knowledge we are the first to present a gradient compression scheme that exploits temporal correlation of gradients.

One-sentence Summary: A novel compression scheme that exploits temporal correlation of gradients for the purpose of gradient compression.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=OzX71iwclz

15 Replies

Loading