A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC0 1.0
Keywords: grokking; non-convex optimisation; dynamic perturbed systems; theory of neural networks; Riemannian optimisation
TL;DR: We describe the training of overparametrized architectures with small weight decay as a two-phase dynamics. In particular during the second phase, it follows a Riemannian flow of the norm on the interpolation manifold.
Abstract: We study the dynamics of gradient flow with small weight decay on general training losses $F: \mathbb{R}^d \to \mathbb{R}$. Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay $\lambda$ exhibits a two-phase behaviour as $\lambda \to 0$. During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of $F$. Then, at time of order $1/\lambda$, the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the $\ell_2$-norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks.
Supplementary Material: zip
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 9319
Loading