Mechanistic Interpretability on (multi-task) Irreducible Integer Identifiers

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, Multi-task Learning, Transformer Models, Gradient Dynamics, Modular Arithmetic, Grokking
TL;DR: We replicate prior work on mechanistic interpretability and grokking, and discover a new category of gradient updates that are neither overfitting nor generalizing in multi-task transformer models.
Abstract: Understanding how neural networks internalize algorithms remains a central chal- lenge in deep learning. We investigate mechanistic interpretability in a multitask setting by training transformer models on 29 parallel modular arithmetic tasks, predicting the remainder of two-digit base-113 numbers modulo distinct primes less than 113 (of which there are 29). In addition to replicating prior findings on the dynamics of generalization ("grokking") and overfitting, we discover a third, previously unreported class of gradient updates that is neither overfitting nor part of a final generalized solution. Through comprehensive embedding, sin- gular value decomposition, and Fourier analysis, we observe that these gradients updates are associated with general but temporary structures. Our results reveal how multitask learning shapes the development and interplay of internal mecha- nisms in deep models, and suggest future directions for speeding up generaliza- tion through more targeted gradient filtering in multitask environments. Code: https://anonymous.4open.science/r/miiii-3B9E/
Primary Area: interpretability and explainable AI
Submission Number: 11297
Loading