Grokking and Generalization Collapse: Insights from HTSR theory

12 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Grokking, Heavy-Tailed Self-Regularization, Random Matrix Theory, Heavy-Tail Exponent, Spectral Analysis, Generalization Dynamics, Catastrophic Generalization Collapse, Implicit Regularization
TL;DR: Through studying grokking we have discovered how to detect when a model is overfit versus poorly trained without having access to either test or train data.
Abstract: Grokking is a surprising phenomenon in neural network training where test accuracy remains low for an extended period despite near-perfect training accuracy, only to suddenly leap to strong generalization. In this work, we study grokking using a depth-3, width-200 ReLU MLP trained on a subset of MNIST. We investigate it's long-term dynamics under both weight-decay and, critically, no-decay regimes—the latter often characterized by increasing $l^2$ weight norms. Our primary tool is the theory of Heavy-Tailed Self-Regularization **HTSR**, where we track the heavy-tailed exponent $\alpha$. We find that $\alpha$ reliably predicts both the initial grokking transition and subsequent anti-grokking. We benchmark these insights against four prior approaches: progress measures---Activation Sparsity, Absolute Weight Entropy, and Approximate Local Circuit Complexity ---and weight norm ($l^2$) analysis. Our experiments show that while comparative approaches register significant changes, **in this regime of increasing $l^2$ norm, the heavy-tailed exponent $\alpha$ demonstrates a unique correlation with the ensuing large, long-term dip in test accuracy, a signal not reliably captured by most other measures.** Extending our zero weight decay experiment significantly beyond typical timescales ($10^{5}$ to approximately $10^{7}$ optimization steps), **we reveal a late-stage catastrophic generalization collapse (``anti-grokking''), characterized by a dramatic drop in test accuracy (over 25 percentage points) while training accuracy remains perfect**; notably, the heavy-tail metric $\alpha$ uniquely provides an early warning of this impending collapse. Our results underscore the utility of Heavy-Tailed Self-Regularization theory for tracking generalization dynamics, even in the challenging regimes without explicit weight decay regularization.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 26909
Loading