Grokking as Simplification: A Nonlinear Complexity Perspective

Published: 02 Nov 2023, Last Modified: 18 Dec 2023UniReps PosterEveryoneRevisionsBibTeX
Keywords: Grokking, network complexity, geometry of representations, information, compression
TL;DR: We explain grokking as simplification from a nonlinear complexity perspective.
Abstract: We attribute grokking, the phenomenon where generalization is much delayed after memorization, to compression. We define linear mapping number (LMN) to measure network complexity, which is a generalized version of linear region number for ReLU networks. LMN can nicely characterize neural network compression before generalization. Although $L_2$ norm has been popular to characterize model complexity, we argue in favor of LMN for a number of reasons: (1) LMN can be naturally interpreted as information/computation, while $L_2$ cannot. (2) In the compression phase, LMN has nice linear relations with test losses, while $L_2$ is correlated with test losses in a complicated nonlinear way. (3) LMN also reveals an intriguing phenomenon of the XOR network switching between two generalization solutions, while $L_2$ does not. Besides explaning grokking, we argue that LMN is a promising candidate as the neural network version of the Kolmogorov complexity, since it explicitly considers local or conditioned linear computations aligned with the nature of modern artificial neural networks.
Track: Extended Abstract Track
Submission Number: 8