Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX
Keywords: regularity, large learning rate, implicit bias, edge of stability, balancing, catapult, gradient descent, convergence
TL;DR: Good regularity of the objective function, in combination of large learning rate, creates implicit biases of gradient descent, including edge of stability, balancing, and catapult.
Abstract: Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Significant theoretical progress has been made to understand these implicit biases, but it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, showing that these implicit biases are different tips of the same iceberg. Specifically, they occur when the optimization objective function has certain regularity. This regularity, together with gradient descent using a large learning rate that favors flatter regions, result in these nontrivial dynamical behaviors. To demonstrate this claim, we develop new global convergence theory under large learning rates for two examples of nonconvex functions without global smoothness, departing from typical assumptions in traditional analyses. We also discuss the implications on training neural networks, where different losses and activations can affect regularity and lead to highly varied training dynamics.
Submission Number: 39