- TL;DR: A practical and principled algorithm for distributed SGD, which simplifies the process of scaling up training
- Abstract: When using distributed training to speed up stochastic gradient descent, learning rates must adapt to new scales in order to maintain training effectiveness. Re-tuning these parameters is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, a practical and principled algorithm that is approximately scale invariant. By continually adapting to the gradient’s variance, AdaScale often trains at a wide range of scales with nearly identical results. We describe this invariance formally through AdaScale’s convergence bounds. As the batch size increases, the bounds maintain final objective values, while smoothly transitioning away from linear speed-ups. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular “linear learning rate scaling” rules. This includes large-scale training without model degradation for machine translation, image classification, object detection, and speech recognition tasks. The algorithm introduces negligible computational overhead and no tuning parameters, making AdaScale an attractive choice for large-scale training.
- Keywords: Large-batch SGD, large-scale learning, distributed training