Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Shuaipeng Li; Penghao Zhao; Hailin Zhang; Samm Sun; Hao Wu; Dian Jiao; Weiyan Wang; Chengjun Liu; Zheng Fang; Jinbao Xue; Yangyu Tao; Bin CUI; Di Wang

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Shuaipeng Li, Penghao Zhao, Hailin Zhang, Samm Sun, Hao Wu, Dian Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin CUI, Di Wang

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimal learning rate, Scaling law

TL;DR: Make theoretical research corrections to the relationship between optimal learning and batch size of models trained using Adam-style optimizers.

Abstract: In current deep learning tasks, Adam-style optimizers—such as Adam, Adagrad, RMSprop, Adafactor, and Lion—have been widely used as alternatives to SGD-style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly (or follows similar rules) with batch size for SGD-style optimizers. However, this conclusion is not applicable to Adam-style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam-style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and optimal learning rates in the “sign of gradient” case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conduct experiments on various CV and NLP tasks and verify the correctness of the scaling law.

Primary Area: Learning theory

Submission Number: 4289

Loading