Grokking at the Edge of Linear Separability

Alon Beck; Noam Itzhak Levi; Yohai Bar-Sinai

Grokking at the Edge of Linear Separability

Alon Beck, Noam Itzhak Levi, Yohai Bar-Sinai

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Grokking, Logistic Regression, Interpolation Threshold, Memorization and Learning

TL;DR: We show that binary logistic regression on a random feature model exhibits Grokking when training on data which is nearly linearly separable, whereby the network first attempts to overfit, and then transitions to the generalizing solution.

Abstract: We study the generalization properties of binary logistic classification in a simplified setting, for which a "memorizing" and "generalizing" solution can always be strictly defined, and elucidate empirically and analytically the mechanism underlying Grokking in its dynamics. Analyzing the final stages of training of logistic classification on Gaussian data with a constant label, we show that it may exhibit Grokking, in the sense of delayed generalization and non-monotonic test loss, when the parameters of the problem are close to a critical point. Specifically, we find that Grokking is amplified when the training set is on the verge of linear separability from the origin. Even though a perfect generalizing solution always exists, the implicit bias of the logisitc loss will cause the model to overfit if the training data is linearly separable from the origin. For training sets that are not separable from the origin, the model will always generalize perfectly in infinite time, but overfitting may occur at early stages of training. Importantly, in the vicinity of the transition, that is, for training sets that are almost separable from the origin, the model may overfit for an arbitrarily long time before generalizing. We gain more insights by examining a tractable one-dimensional toy model that quantitatively captures the key features of the full model. Finally, we highlight intriguing common properties of our findings with recent literature, suggesting that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.

Primary Area: learning theory

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7948

Loading