everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
One puzzling artifact in machine learning, dubbed grokking, refers to the case where a model exhibits delayed generalization after numerous training iterations after nearly perfect overfitting. Focusing on the long delay itself on behalf of machine learning practitioners, our primary goal is to accelerate the generalization of a model under the grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component, and the slow-varying, generalization-inducing component. This analysis allows us to accelerate the grokking phenomenon more than $\times 50$ with only a few lines of code that amplifies the slow-varying components of the gradients. The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling the practical availability of this peculiar artifact of sudden generalization. Moreover, we reinterpret momentum hyperparameters in gradient-based optimizers as low-pass filters with size-1 windows. This bridges between optimization and classical signal processing literature, suggesting a new type of optimzers augmented with frequecy-domain filters.