Sparse Fourier Regularization for Modular Arithmetic Models

06 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Modular arithmetic, Fourier Regularization
TL;DR: Training modular arithmetic models with $\ell_1$ regularization in the Fourier space can bypass grokking, and encourage disentangled embeddings.
Abstract: Modular arithmetic serves as a useful test bed for observing empirical phenomena in deep learning, including grokking. Prior work in mechanistic interpretability has shown that sequence models such as transformers and recurrent networks eventually converge to a Fourier multiplication strategy for solving these tasks. In this paper we introduce $\ell_1$ regularization in the Fourier space of the (un)embedding layers to bypass grokking and train modular arithmetic models up to $3 \times$ faster. We also study the embedding geometry of models trained on multiple arithmetic operations and show how models trained on multiple operations in the same group (like addition and subtraction) use the same Fourier spectrum, while models trained on multiple operations across different groups (like addition and multiplication) entangle their Fourier spectra in the same embedding dimensions - making targeted interventions harder. Here again $\ell_1$ Fourier regularization applied to groups of embedding dimensions disentangles the Fourier spectra corresponding to different tasks.
Submission Number: 37
Loading