Attention Layers Add Into Low-Dimensional Residual Subspaces

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Understanding high-level properties of models, Transformer
TL;DR: We find attention outputs have a low-rank structure, identify this as the root cause of dead features, and solve it.
Abstract: Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are confined to a surprisingly low-dimensional subspace, where about 60\% of the directions account for 99\% of the variance--a phenomenon that is consistently observed across diverse model families and datasets, and is structurally imposed by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87\% to below 1\% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models. Code is available at \url{https://anonymous.4open.science/r/Language-Model-SAEs-2B1D/}.
Primary Area: interpretability and explainable AI
Submission Number: 5077
Loading