Grokking, Rank Minimization and Generalization in Deep Learning

David Yunis; Kumar Kshitij Patel; Samuel Wheeler; Pedro Henrique Pamplona Savarese; Gal Vardi; Karen Livescu; Michael Maire; Matthew Walter

Grokking, Rank Minimization and Generalization in Deep Learning

David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Henrique Pamplona Savarese, Gal Vardi, Karen Livescu, Michael Maire, Matthew Walter

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Grokking, low rank, neural networks, generalization

TL;DR: We find that grokking coincides with low-rank weight matrices, and see how neural nets tend toward rank minimization in more general settings, where weight decay controls the trend.

Abstract: Much work has been devoted to explaining the recently discovered \textit{``grokking''} phenomenon, where a neural network first fits the training loss, then many iterations later suddenly fits the validation loss. To explore this puzzling behavior, we examine the evolution of singular values and vectors of weight matrices inside the neural network. First we show that the transition to generalization in grokking coincides with the discovery of a low-rank solution in the weights. We then show that the trend towards rank minimization is much more general than grokking alone and elucidate the crucial role that weight decay plays in promoting this trend. Such analysis leads to a deeper understanding of generalization in practical systems.

Submission Number: 102

Loading