Keywords: SGD, learning rate, batch size, optimization, generalization, implicit bias, implicit regularization, sharpness, scaling rule
TL;DR: We find that SGD induces an implicit regularization on the interaction between the gradient distribution and the loss landscape geometry, and we propose a more accurate scaling rule between batch size and learning rate.
Abstract: We study unstable dynamics of stochastic gradient descent (SGD) and its impact on generalization in neural networks. We find that SGD induces an implicit regularization on the interaction between the gradient distribution and the loss landscape geometry. Moreover, based on the analysis of a concentration measure of the batch gradient, we propose a more accurate scaling rule, Linear and Saturation Scaling Rule (LSSR), between batch size and learning rate.
Supplementary Material: zip
14 Replies
Loading