When Data Falls Short: Grokking Below the Critical Threshold

Published: 23 Sept 2025, Last Modified: 11 Nov 2025CCFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Grokking, Continual Learning, Knowledge Distillation
TL;DR: Investigating grokking phenomena below the critical data regime
Abstract: In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution ($p_1$) can induce and accelerate grokking on a different distribution ($p_2$), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution ($p_1$, $p_2$) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from $p_1$ to $p_2$, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10\% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.
Serve As Reviewer: ~Vaibhav_Singh1
Submission Number: 19
Loading