Abstract: In distributed training, the communication cost due to the transmission of gradients
or the parameters of the deep model is a major bottleneck in scaling up the number
of processing nodes. To address this issue, we propose dithered quantization for
the transmission of the stochastic gradients and show that training with Dithered
Quantized Stochastic Gradients (DQSG) is similar to the training with unquantized
SGs perturbed by an independent bounded uniform noise, in contrast to the other
quantization methods where the perturbation depends on the gradients and hence,
complicating the convergence analysis. We study the convergence of training
algorithms using DQSG and the trade off between the number of quantization
levels and the training time. Next, we observe that there is a correlation among the
SGs computed by workers that can be utilized to further reduce the communication
overhead without any performance loss. Hence, we develop a simple yet effective
quantization scheme, nested dithered quantized SG (NDQSG), that can reduce the
communication significantly without requiring the workers communicating extra
information to each other. We prove that although NDQSG requires significantly
less bits, it can achieve the same quantization variance bound as DQSG. Our
simulation results confirm the effectiveness of training using DQSG and NDQSG
in reducing the communication bits or the convergence time compared to the
existing methods without sacrificing the accuracy of the trained model.
Keywords: machine learning, distributed training, dithered quantization, nested quantization, distributed compression
TL;DR: The paper proposes and analyzes two quantization schemes for communicating Stochastic Gradients in distributed learning which would reduce communication costs compare to the state of the art while maintaining the same accuracy.
Data: [MNIST](https://paperswithcode.com/dataset/mnist)
9 Replies
Loading