TL;DR: We show that "emergence" in the task of grokking modular arithmetic occurs in feature learning kernels using the Average Gradient Outer Product (AGOP) and that the features take the form of block-circulant features.
Abstract: Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.
Lay Summary: We study the problem of "grokking", where the test accuracy on a task starts improving long after the model achieves perfect training accuracy in the training process. Modular arithmetic tasks, in particular, are a classic example of grokking and are primarily studied in neural networks. Therefore we focus our study on modular addition, subtraction, multiplication, and division. We show that "emergent" characteristics in accuracy and loss curves observed for modular arithmetic are not exclusive to neural networks nor to standard methods for training neural networks, but also show up in kernel machines that recursively learn features and do not use back-propagation methods. We train both feature learning kernels and neural networks on modular arithmetic tasks and show that the learned feature structures take the same form, suggesting that both model classes learn a similar set of features. We further show that transforming input data with random features based on this observed feature structure enables both kernels and neural networks to immediately learn modular arithmetic tasks without delayed generalization. This suggests that our understanding of the feature learning process for these tasks provides a general and prescriptive way to solve the tasks themselves, agnostic to model class and methods of training. We finally prove theoretically that kernel machines equipped with features based on this observed feature structure learn a well-known, generic algorithm for solving modular addition called the Fourier Multiplication Algorithm.
Link To Code: https://github.com/nmallinar/rfm-grokking/tree/main
Primary Area: General Machine Learning->Representation Learning
Keywords: Theory of deep learning, grokking, modular arithmetic, feature learning, kernel methods, average gradient outer product (AGOP), emergence
Submission Number: 14743
Loading