Grokking and Generalization Collapse: Insights from HTSR theory

Published: 09 Jun 2025, Last Modified: 09 Jun 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Grokking, Heavy-Tailed Self-Regularization, Random Matrix Theory, Heavy-Tail Exponent, Spectral Analysis, Generalization Dynamics, Catastrophic Generalization Collapse, Implicit Regularization
TL;DR: Discovering a new phase of grokking after 10 million steps through a Novel application of HTSR,RMT
Abstract: We study the well-known grokking phenomena in neural networks (NNs) using a 3-layer MLP trained on 1 k-sample subset of MNIST, with and without weight decay, and discover a novel third phase —**anti-grokking**--that occurs very late in training and resembles but is distinct from the familiar **pre-grokking** phases: test accuracy collapses while training accuracy stays perfect This late-stage collapse is distinct, however, from the known **pre-grokking** and **grokking** phases, and is not detected by other proposed grokking progress measures. Leveraging Heavy-Tailed Self-Regularization **HTSR** through the open-source~ WeightWatcher tool, we show that the **HTSR** layer quality metric $\alpha$ alone delineates **all three** phases, whereas the best competing metrics detect only the first two. The anti-grokking is revealed by training for $10^{7}$ and is invariably heralded by $\alpha<2$ and the appearance of **Correlation Traps**—outlier singular values in the randomized layer weight matrices that make the layer weight matrix **atypical** and signal overfitting of the training set. Such traps are verified by visual inspection of the layer-wise empirical spectral densities, and using Kolmogorov–Smirnov tests on randomized spectra. Comparative metrics, including activation sparsity, absolute weight entropy, circuit complexity, and $l^{2}$ weight norms track pre-grokking and grokking but fail to distinguish grokking from anti-grokking. This discovery provides a way to measure overfitting and generalization collapse without direct access to the test data. These results strengthen the claim that the **HTSR** $\alpha$ provides universal layer-convergence target at $\alpha\approx 2$ and underscore the value of using the **HTSR** alpha $(\alpha)$ metric as a measure of generalization.
Student Paper: Yes
Submission Number: 71
Loading