Abstract: Building a principled understanding of generalization in deep learning requires unifying disparate observations under a single conceptual framework. Previous work has studied grokking, a training dynamic in which a sustained period of near-perfect training performance and near-chance test performance is eventually followed by generalization, as well as the superficially similar double descent. These topics have so far been studied in isolation. We hypothesize that grokking and double descent can be understood as instances of the same learning dynamics within a framework of pattern learning speeds, and that this framework also applies when varying model capacity instead of optimization steps. We confirm some implications of this hypothesis empirically, including demonstrating model-wise grokking.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2303.06173/code)