TL;DR: We demonstrate that simple binary logistic regression can exhibits grokking when trained on nearly linearly separable data, initially overfitting before transitioning to a generalizing solution.
Abstract: We investigate the phenomenon of grokking -- delayed generalization accompanied by non-monotonic test loss behavior -- in a simple binary logistic classification task, for which "memorizing" and "generalizing" solutions can be strictly defined.
Surprisingly, we find that grokking arises naturally even in this minimal model when the parameters of the problem are close to a critical point, and provide both empirical and analytical insights into its mechanism.
Concretely, by appealing to the implicit bias of gradient descent, we show that logistic regression can exhibit grokking when the training dataset is nearly linearly separable from the origin and there is strong noise in the perpendicular directions.
The underlying reason is that near the critical point, "flat" directions in the loss landscape with nearly zero gradient cause training dynamics to linger for arbitrarily long times near quasi-stable solutions before eventually reaching the global minimum.
Finally, we highlight similarities between our findings and the recent literature, strengthening the conjecture that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.
Lay Summary: We investigate why a strange training pattern called “grokking” occurs in very simple machine learning models. Grokking happens when a model at first manages to fit the training data perfectly but fails to generalize (i.e., doesn’t perform well on new data). After more training, it suddenly figures out how to generalize properly.
We discovered that grokking happens in simple yes/no (binary) classification tasks when the training data is almost perfectly separated by a straight line (hyperplane), but not quite. This causes the model to get stuck for a long time before finally learning to generalize. We believe this is similar to how physical systems behave near phase transitions, which is really interesting.
Primary Area: Theory->Learning Theory
Keywords: Grokking, Logistic Regression, Interpolation Threshold, Memorization and Learning
Submission Number: 7737
Loading