Keywords: mechanistic interpretation, training dynamics, modular addition, feature learning
TL;DR: We demystify the feature learning and training dynamics of the gradient-based training on modular addition task.
Abstract: We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task.
Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics.
First, we empirically show that trained networks learn a sparse Fourier representation; each neuron's parameters form a trigonometric pattern corresponding to a single frequency.
We identify two key structural properties: phase alignment, where a neuron's output phase is twice its input phase, and model symmetry, where phases are uniformly distributed among neurons sharing the same frequency, particularly when overparametrized.
We prove that these properties allow the network to collectively approximate an indicator function on the correct logic for the modular addition task.
While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum.
We then explain how these features are learned through a "lottery ticket mechanism".
An analysis of the gradient flow reveals that frequencies compete within each neuron during training.
The winning frequency that ultimately dominates is predictably determined by its initial magnitude and phase misalignment.
Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases driven by feature sparsification.
Primary Area: interpretability and explainable AI
Submission Number: 20794
Loading