Grokking Beyond the Euclidean Norm of Model Parameters

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Grokking, sudden generalization following prolonged overfitting, can be triggered using alternative regularizations like $\ell_1$ and nuclear norms or leveraging depth-induced implicit biases without relying solely on $\ell_2$ norms.
Abstract: Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property $P$ (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the $\ell_2$ norm is not a reliable proxy for generalization when the model is regularized toward a different property $P$, as the $\ell_2$ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed.
Lay Summary: Sometimes when learning, children do not seem to understand something at first — they simply mimic what they see. But after enough repetition, something clicks: they suddenly "get it" and can apply the idea in new situations. The same thing can happen with artificial intelligence (AI). AI models often start by memorizing the training examples. Yet, after a surprisingly long time, they begin to understand the underlying patterns and solve problems they have never seen before. This sudden shift is called **grokking**. Our research investigates why grokking happens and how to influence it. We find that it is not just about the model’s architecture — grokking also depends on the kind of simplicity (**regularization**) enforced during training, such as using fewer connections (**sparsity**) or a simpler internal structure (**low-rankness**). In some cases, we even show that grokking is necessary for a model to reach an optimal solution. However, when simplicity is enforced, training can take significantly longer. There is a tradeoff: small regularization can improve generalization, but requires more training time. Our results provide a way to manage this tradeoff based on the resources available and the kind of behavior we want from the model. These insights help explain why some AI systems require much more training than expected to reach deep understanding — and how we can guide them more effectively.
Link To Code: https://github.com/Tikquuss/grokking_beyong_l2_norm
Primary Area: Theory->Deep Learning
Keywords: Grokking, Delayed Generalization, Regularization, Sparsity, Low-Rank, Overparameterization, Gradient Descent, Implicit Regularization
Submission Number: 5603
Loading