Keywords: Grokking, Heavy-Tailed Self-Regularization, Random Matrix Theory, Heavy-Tail Exponent, Spectral Analysis, Generalization Dynamics, Catastrophic Generalization Collapse, Implicit Regularization
TL;DR: Discovering a new phase of grokking after 10 million steps through a Novel application of HTSR,RMT
Abstract: We study the well-known grokking phenomena in neural networks (NNs) using a 3-layer MLP trained on 1 k-sample subset of MNIST, with and without weight decay, and discover a novel third phase —**anti-grokking**--that occurs very late in training and resembles but is distinct from the familiar **pre-grokking** phases: test accuracy collapses while training accuracy stays perfect
This late-stage collapse is distinct, however, from the known **pre-grokking** and **grokking** phases, and is not detected by other proposed grokking progress measures.
Leveraging Heavy-Tailed Self-Regularization **HTSR** through the open-source~ WeightWatcher tool, we show that the **HTSR** layer quality metric $\alpha$ alone delineates **all three** phases, whereas the best competing metrics detect only the first two.
The anti-grokking is revealed by training for $10^{7}$ and is invariably heralded by $\alpha<2$ and the appearance of **Correlation Traps**—outlier singular values in the randomized layer weight matrices that make the layer weight matrix **atypical** and
signal overfitting of the training set.
Such traps are verified by visual inspection of the layer-wise empirical spectral densities, and using Kolmogorov–Smirnov tests on randomized spectra.
Comparative metrics, including activation sparsity, absolute weight entropy, circuit complexity, and $l^{2}$ weight norms track pre-grokking and grokking but fail to distinguish grokking from anti-grokking.
This discovery provides a way to measure overfitting and generalization collapse without direct access to the test data.
These results strengthen the claim that the **HTSR** $\alpha$ provides universal layer-convergence target at $\alpha\approx 2$ and underscore the value of using the **HTSR** alpha $(\alpha)$ metric as a measure of generalization.
Student Paper: Yes
Submission Number: 71
Loading