Does LLM Pre-Training Typically Occur at the Edge of Stability?

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Edge of Stablity, Edge of Convexity, LLM pretraining, Quadratic Approximation
Abstract: Quadratic approximations are a common lens for neural network optimization, but recent evidence challenges their predictive validity. In full-batch gradient descent with LR $\eta$, Cohen et al. (2021) observed the Edge of Stability (EoS), where the largest Hessian eigenvalue concentrates near $2/\eta$, in tension with classical stability conditions. In this work, we revisit the fidelity of quadratic approximation as a model of neural network training dynamics, with particular focus on its failure modes in LLM training. We first identify and decouple a distinct failure mechanism of the quadratic approximation regardless of the LR choice, which arises from persistent negative curvature during training, which we term the *Edge of Convexity* (EoC). Based on the decoupling from EoC, we then extend the definition of EoS to large-scale stochastic training with adaptive optimizers. Across different LLM pretraining with various model sizes up to $1.7$B, we find: (1) EoC is always observed across LLM pretraining. (2) EoS is also prevalent but not universal; it disappears when the LR becomes sufficiently small (e.g., after decay) or when the batch size falls below a critical threshold that is linearly related to the critical batch size. Together, these findings characterize when and how quadratic approximations fail and serve as foundations for future work on understanding the training dynamics of modern neural networks.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 52
Loading