Noise-driven escape from metastable phases explains grokking in deep neural networks

Ibrahim Talha Ersoy; Karoline Wiesner

Noise-driven escape from metastable phases explains grokking in deep neural networks

Ibrahim Talha Ersoy, Karoline Wiesner

Published: 29 May 2026, Last Modified: 15 Jun 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: grokking, phase transitions, activated escape, loss landscape, L2 regularization, optimization dynamics

TL;DR: We show that grokking arises from noise-activated escape out of metastable states created by L2 phase transitions, with escape times obeying Arrhenius scaling.

Abstract: Deep neural networks (DNNs) exhibit first-order phase transitions under variations of the L2 regularisation strength, with each transition marking the onset of a new learnable feature. Below a critical regularisation strength, all features are in principle learnable, but coexisting metastable states, separated by energy barriers, can trap the network and impede convergence. A strength of DNNs is their ability to generalise. But many open questions remain, among them the origin of so-called grokking: the abrupt, delayed onset of generalisation after prolonged apparent overfitting. We show for linear DNNs that grokking is consistent with hysteresis in first-order L2 phase transitions: using L2 regularisation to engineer deliberate trapping, we demonstrate that a model in a low-accuracy metastable state escapes only when SGD noise drives it across an energy barrier, with escape times following Arrhenius scaling. We reproduce grokking-like delayed convergence across two orders of magnitude in escape time by deliberately trapping models in metastable phases. Using sparse sub-sampling we also reproduce the canonical grokking curve where test error eventually approaches the final training error. Our work suggests that the number of metastable states equals the number of learnable features -- one per singular value of the data covariance -- the potential for hysteresis grows naturally with task complexity. We provide evidence that the same mechanism likely operates in general nonlinear DNNs. Our results provide routes toward more efficient learning schemes.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 55

Loading