Abstract: Distributed training is an effective way to accelerate the training of a large-scale deep neural network(DNN) model, while the data communication between computational workers is always a bottleneck because of the frequent parameter communication at each iteration. To alleviate this problem, many communication-efficient distributed training algorithms based on synchronized stochastic gradient descent(SSGD) are proposed to reduce the amount of the exchanged data by compressing or quantizing the gradients such as the top-k SGD, TernGrad. However, these methods introduce extra intensive computation because of the frequent operation on tensor and memory accessing. Especially when the faster communication backends such as nccl(i.e, NVIDIA Collective Communications Library) or faster communication hardware is proffered in the near future, they no longer have a time advantage. In this paper, we propose an efficient sparsification method, layer-based random SGD (LR-SGD), that randomly select a certain number of layers of the DNN model to be exchanged instead of some elements of each tensor, which reduces communication while keep the performance close to the SSGD. Particularly, we use a hyper parameter k (i.e., indicates the number of layers to be selected) to adjust the compressing ratio while two probabilistic models are utilized to select the layers being exchanged. To validate the proposed method, we conduct several experiments on different datasets with two different scale DNN models on a stimulative cluster. The results demonstrate that the layer-based random sparsification method effectively reduce the communication overhead while maintain the high accuracy. Our code is available at https://github.com/prototype-zzy/LR-SGD.
0 Replies
Loading