Delayed Generalization: Bridging Double Descent and Grokking

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: double descent, grokking, science of deep learning, empirical theory of deep learning, generalization, overfitting, delayed generalization, feature learning, pattern learning, representation learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We argue that grokking and double descent are better understood as similar instances of a broader phenomenon that we call \emph{Staggered Learning}.
Abstract: A popular approach to understanding generalization in neural networks is to study phenomena such as double descent or grokking, in which learning curves exhibit non-monotonicity and generalization long after overfitting. So far, these topics have been studied in isolation. We unify double descent and grokking by showing important similarities; in particular we provide the first demonstrations of grokking with respect to model size, regularization strength, and sample size, and show it is possible to empirically interpolate between grokking and double descent in various settings. We argue that grokking and double descent are better understood as instances of a broader phenomenon that we call \emph{Staggered Learning}, and introduce a model of delayed generalization in terms of pattern learning speeds.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8387
Loading