Grokking vs. Learning: Same Features, Different Encodings

Dmitry Manning-Coe, Alexander G. Stapleton, Jacopo Gliozzi, Edward Hirst, Giuseppe De Tomasi, Barry Bradlyn, David Berman

Published: 02 May 2026, Last Modified: 07 May 2026Machine Learning Science and TechnologyEveryoneCC BY 4.0

Abstract: Grokking typically achieves similar losses to ordinary, ‘steady’, learning. This work asks whether these different learning paths lead to fundamental differences in the learned models. To do so, we compare the features, compressibility, and learning dynamics of models trained via each path in two controlled toy tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel ‘compressive regime’ of steady training in which there emerges a linear trade-off between model loss and compressibility, which is absent in grokking. In this regime, one can realise compression factors of 25x in the model obtained by steady learning, and 5x in the model achieved by grokking. Model features and compressibility are then tracked through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.