The large learning rate phase of deep learning

Aitor Lewkowycz; Yasaman Bahri; Ethan Dyer; Jascha Sohl-Dickstein; Guy Gur-Ari

The large learning rate phase of deep learning

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

28 Sept 2020 (modified: 15 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: deep learning, wide networks, training dynamics

Abstract: The choice of initial learning rate can have a profound effect on the performance of deep networks. We present empirical evidence that networks exhibit sharply distinct behaviors at small and large learning rates. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates, we find that networks exhibit qualitatively distinct phenomena that cannot be explained by existing theory: The loss grows during the early part of training, and optimization eventually converges to a flatter minimum. Furthermore, we find that the optimal performance is often found in the large learning rate phase. To better understand this behavior we analyze the dynamics of a two-layer linear network and prove that it exhibits these different phases. We find good agreement between our analysis and the training dynamics observed in realistic deep learning settings.

One-sentence Summary: The loss grows early on in training if the learning rate is large, and understanding this in full requires new theory.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/the-large-learning-rate-phase-of-deep/code)

Reviewed Version (pdf): /references/pdf?id=GV8UiIPmEg

7 Replies

Loading