Keywords: grokking, scaling laws, phase transitions, compute-optimal training, learning dynamics
TL;DR: We derive compute-optimal scaling laws for grokking onset from a 384-configuration sweep, showing dataset fraction dominates timing and wider models trade faster steps for higher FLOP cost.
Abstract: We derive compute-optimal scaling laws for the generalization phase transition known as grokking. A 384-configuration sweep of two-layer MLPs on modular arithmetic yields $T_{\mathrm{grok}} \propto H^{-0.27} D^{-2.04} \eta^{-0.50} \lambda^{-0.64}$ ($R^2 = 0.73$; $0.82$ with interactions), where $H$ is width (exponent from three width levels), $D$ dataset fraction, $\eta$ learning rate, and $\lambda$ weight decay. Dataset fraction dominates: doubling $D$ cuts grokking time by ~4x. A phase diagram in $(\eta, \lambda)$ space reveals a sharp boundary separating grokking from non-grokking regimes. Compute-optimal analysis shows wider models grok in fewer steps but at higher FLOP cost, mirroring Chinchilla-style trade-offs. The law predicts generalization onset from hyperparameters alone, before training begins, enabling practitioners to set compute budgets without pilot runs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 66
Loading