DenseStream: A Novel Data Representation for Gradient Sparsification in Distributed Synchronous SGD Algorithms

Published: 2023, Last Modified: 21 May 2026IJCNN 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Distributed training is widely used in training large-scale deep learning model, and data parallelism is one of the dominant algorithms. Data-parallel training has additional communication overhead, which greatly affects the training at low bandwidth. Gradient sparsification is a promising technique to reduce the communication volume, which keeps a small number of important gradient values and sets the rest to zero. However, the communication of sparsified gradients suffer from scalability issues for (1) the communication volume of the AllGather algorithm, which is commonly used to accumulate sparse gradients, increases linearly with the number of nodes, and (2) sparse local gradients may return dense due to gradient accumulation. These issues hinder the application of gradient sparsification. We observe that sparse gradient value distribution has great locality, and therefore we propose DenseStream, a novel data representation for sparse gradients in data-parallel training to alleviate the issues. DenseStream integrates an efficient sparse AllReduce algorithm with the synchronous SGD (S-SGD). Evaluations are conducted by real-world applications. Experimental results show that DenseStream achieves better compression ratio at higher densities and can represent sparse vectors with a wider range of densities. Compared with dense AllReduce, our method is more scalable and achieves 3.1-12.1x improvement.
Loading