Abstract: Grokking typically achieves similar losses to ordinary, ‘steady’, learning. This work asks whether these
different learning paths lead to fundamental differences in the learned models. To do so, we compare the
features, compressibility, and learning dynamics of models trained via each path in two controlled toy
tasks. We find that grokked and steadily trained models learn the same features, but there can be
large differences in the efficiency with which these features are encoded. In particular, we find a novel
‘compressive regime’ of steady training in which there emerges a linear trade-off between model loss and
compressibility, which is absent in grokking. In this regime, one can realise compression factors of 25x
in the model obtained by steady learning, and 5x in the model achieved by grokking. Model features
and compressibility are then tracked through training. We show that model development in grokking
is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau.
Finally, novel information-geometric measures are introduced which demonstrate that models undergoing
grokking follow a straight path in information space.
Loading