Quantifying the Memorization-to-Generalization Transition: Scaling Laws and Phase Structure in Grokking

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: grokking, scaling laws, phase transitions, memorization, generalization, implicit regularization, weight decay
TL;DR: We map the memorization-to-generalization boundary across 384 configurations, fitting a power-law scaling law for generalization onset and identifying a sharp phase boundary governed by weight decay.
Abstract: Neural networks trained past memorization frequently undergo a delayed transition to generalization, a phenomenon known as grokking. Despite theoretical progress on why this transition occurs, the quantitative structure of when it occurs in hyperparameter space remains uncharacterized. We map the memorization-to-generalization boundary across 384 configurations of two-hidden-layer MLPs on modular arithmetic, fitting a power-law scaling relation for generalization onset time: $T_{\mathrm{grok}} \propto H^{-0.27} D^{-2.04} \eta^{-0.50} \lambda^{-0.64}$ ($R^2 = 0.732$; $0.821$ with interactions). The exponent hierarchy reveals that data complexity ($D^{-2.04}$) is the dominant driver of regime transition, not model capacity ($H^{-0.27}$): doubling data accelerates generalization by ~4x, while doubling width yields only ~1.2x. A sharp phase boundary at weight decay $\lambda \gtrsim 1.0$ separates grokking from non-grokking configurations, and weight norm trajectories show monotonic compression during the transition, consistent with implicit regularization selecting low-complexity solutions. These results provide a quantitative foundation for predicting and controlling regime transitions in overparameterized networks.
Submission Number: 172
Loading