Understanding In-context Learning of Addition via Activation Subspaces

Understanding In-context Learning of Addition via Activation Subspaces

ICLR 2026 Conference Submission21721 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: In-Context Learning, Mechanistic Interpretability, LLM, Arithmetic Tasks

TL;DR: We localize the few-shot learning ability of addition for Llama3 to 3 attention heads within interpretable 6-dim activation subspaces and examine how these heads extract information from the few-shot examples.

Abstract: To perform few-shot learning, language models extract signals from a few input-label pairs, aggregate these into a learned prediction rule, and apply this rule to new inputs. How is this implemented in the forward pass of modern transformer models? To explore this question, we study a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We introduce a novel optimization method that localizes the model's few-shot ability to only a few attention heads. We then perform an in-depth analysis of individual heads, via dimensionality reduction and decomposition. As an example, on Llama3-8B-instruct, we reduce its mechanism on our tasks to just three attention heads with six-dimensional subspaces, where four dimensions track the unit digit with trigonometric functions at periods $2$, $5$, and $10$; and two dimensions track magnitude with low-frequency components. To more deeply understand the mechanism behind this, we also derive a mathematical identity relating "aggregation" and "extraction" subspaces for attention heads, allowing us to track the flow of information from individual examples to a final aggregated concepts. Using this, we identify a self-correction mechanism where mistakes learned from earlier demonstrations are suppressed by later demonstrations. Our results demonstrate how tracking low-dimensional subspaces of localized heads across a forward pass can provide insight into fine-grained computational structures in language models.

Primary Area: interpretability and explainable AI

Submission Number: 21721

Loading