AdaScale SGD: A Scale-Invariant Algorithm for Distributed Training

Tyler B. Johnson; Pulkit Agrawal; Haijie Gu; Carlos Guestrin

AdaScale SGD: A Scale-Invariant Algorithm for Distributed Training

Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

TL;DR: A practical and principled algorithm for distributed SGD, which simplifies the process of scaling up training

Abstract: When using distributed training to speed up stochastic gradient descent, learning rates must adapt to new scales in order to maintain training effectiveness. Re-tuning these parameters is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, a practical and principled algorithm that is approximately scale invariant. By continually adapting to the gradient’s variance, AdaScale often trains at a wide range of scales with nearly identical results. We describe this invariance formally through AdaScale’s convergence bounds. As the batch size increases, the bounds maintain final objective values, while smoothly transitioning away from linear speed-ups. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular “linear learning rate scaling” rules. This includes large-scale training without model degradation for machine translation, image classification, object detection, and speech recognition tasks. The algorithm introduces negligible computational overhead and no tuning parameters, making AdaScale an attractive choice for large-scale training.

Keywords: Large-batch SGD, large-scale learning, distributed training

Original Pdf: pdf

12 Replies

Loading