On the Synergy Between Label Noise and Learning Rate Annealing in Neural Network Training

Published: 26 Oct 2023, Last Modified: 13 Dec 2023NeurIPS 2023 Workshop PosterEveryoneRevisionsBibTeX
Keywords: Deep learning theory, non-convex optimization
Abstract: In the past decade, stochastic gradient descent (SGD) has emerged as one of the most dominant algorithms in neural network training, with enormous success in different application scenarios. However, the implicit bias of SGD with different training techniques is still under-explored. Some of the common heuristics in practice include 1) using large initial learning rates and decaying it as the training progresses, and 2) using mini-batch SGD instead of full-batch gradient descent. In this work, we show that under certain data distributions, these two techniques are both necessary to obtain good generalization on neural networks. We consider mini-batch SGD with label noise, and at the heart of our analysis lies the concept of feature learning order, which has previously been characterized theoretically by Li et al. (2019) and Abbe et al. (2021). Notably, we use this to give the first concrete separations in generalization guarantees, between training neural networks with both label noise SGD and learning rate annealing and training with one of these elements removed.
Submission Number: 106