Abstract: Synchronous Stochastic Gradient Descent (SGD) with data parallelism, the most popular parallel training strategy for deep learning, suffers from expensive gradient communications. Local SGD with periodic model averaging is a promising alternative to synchronous SGD. The algorithm allows each worker to locally update its own model, and periodically averages the model parameters across all the workers. While this algorithm enjoys less frequent communications, the convergence rate is strongly affected by the number of workers. In order to scale up the local SGD training without losing accuracy, the number of workers should be sufficiently small so that the model converges reasonably fast. In this paper, we discuss how to exploit the degree of parallelism in local SGD while maintaining model accuracy. Our training strategy employs multiple groups of processes and each group trains a local model based on data parallelism. The local models are periodically averaged across all the groups. Based on this hierarchical parallelism, we design a model averaging algorithm that has a cheaper communication cost than allreduce-based approach. We also propose a practical metric for finding the maximum number of workers that does not cause a significant accuracy loss. Our experimental results demonstrate that our proposed training strategy provides a significantly improved scalability while achieving a comparable model accuracy to synchronous SGD.
0 Replies
Loading