Keywords: grokking, optimisation, linear algebra, SVD, compression
TL;DR: A novel SVD based training regime that mitigates and alleviates grokking in modular arithmetic, specifically mod 97.
Abstract: Grokking is a delayed transition from memorisation to generalisation in neural networks. It challenges perspectives on efficient learning, particularly in structured tasks and small-data regimes. We explore grokking in modular arithmetic from the perspective of a training pathology. We use Singular Value Decomposition (SVD) to modify the weight matrices of neural networks by changing the representation of the weight matrix, $W$, into the product of three matrices, $U$, $\Sigma$ and $V^T$. Through empirical evaluations on the modular addition task, we show that this representation significantly reduces the effect of grokking and, in some cases, eliminates it.
Code: ipynb
Submission Number: 70
Loading