Keywords: grokking, entropy, generalization, glass
TL;DR: By framing grokking as computational glass relaxation, this work explains grokking from the perspective of Boltzmann entropy and proposes a physics-based grokking-resistant optimizer.
Abstract: Understanding neural network' (NN) generalizability remains a central question in deep learning research.
The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability.
Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration.
This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy.
Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition.
We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant.
Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions.
This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.
Supplementary Material: zip
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 16796
Loading