Transformers Converge to Invariant Algorithmic Cores

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automated interpretability, Methods (probing, steering, causal interventions), Other
Other Keywords: Conceptual
TL;DR: Transformers converge to low-dimensional subspaces that are necessary, sufficient, and shared across runs. Extracting them recovers ground-truth dynamics, explains grokking via an inverse scaling law, and steers verb agreement across six LLMs.
Abstract: Training selects for behavior, not circuitry: many weight configurations can implement the same function. Studying any single trained neural network thus risks describing accidents of one training run rather than the computation itself. This work shifts focus from what transformers happen to do to what they must do by extracting algorithmic cores, compact subspaces that are necessary and sufficient for a task and that recur across independently trained models. Here, Algorithmic Core Extraction (ACE) is introduced to isolate these subspaces, causally validate them, and recover the algorithms they implement across settings ranging from synthetic tasks to large-scale pretrained models. Markov-chain transformers embed three-dimensional cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers form compact cyclic cores at grokking that later inflate under continued regularization, redundantly distributing the same computation across many functionally equivalent modes. This functional redundancy is found to accelerate the transition from memorization to generalization, yielding an inverse scaling law for grokking time. In six language models spanning two orders of magnitude in scale (GPT-2 Small/Medium/Large, LLaMA-3.1, Gemma-2, and Qwen2.5), subject-verb agreement is governed by a single, steerable axis that aligns across architectures. Flipping this axis inverts grammatical number throughout open-ended generation. Together these results suggest that beneath the apparent complexity of trained transformers lies a simpler, shared computational structure, and that targeting invariants rather than parameterizations may offer a more tractable path to mechanistic understanding and control.
Submission Number: 205
Loading