SGD with large step sizes learns sparse features

Maksym Andriushchenko; Aditya Vardhan Varre; Loucas Pillaud-Vivien; Nicolas Flammarion

SGD with large step sizes learns sparse features

Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, Nicolas Flammarion

Published: 01 Feb 2023, Last Modified: 22 Jun 2025Submitted to ICLR 2023Readers: Everyone

Keywords: SGD, large step sizes, implicit regularization, feature learning

Abstract: We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that the commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing \textit{loss stabilisation} (ii) this stabilisation induces a hidden stochastic dynamics orthogonal to the bouncing directions that \textit{biases it implicitly} toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD training dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models. Finally, we shed a new light on some common practice and observed phenomena when training neural networks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

TL;DR: Loss stabilization achieved via SGD with large step sizes leads to a hidden dynamics that promotes sparse feature learning

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/sgd-with-large-step-sizes-learns-sparse/code)

24 Replies

Loading