Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: LLMs, Tranformer Circuits, Mechanistic Interpretability
TL;DR: We propose a new method for interpretating transformer circuit by performing SVD on query-value and value-output matrices
Abstract: Transformer-based language models exhibit complex behavior, but their internal computations remain poorly understood. Most mechanistic interpretability approaches treat components, such as attention heads and MLPs, as atomic units, ignoring potential functional substructure. We propose a finer-grained perspective that models components as superpositions of orthogonal singular directions. This perspective allows multiple independent computations to coexist within a single head or MLP, enabling selective intervention, attribution, and interpretation at a level of granularity beyond previous methods. We demonstrate this approach on the Indirect Object Identification (IOI) task, showing that well-known functional heads, like the “name mover,” encode overlapping subfunctions aligned with distinct singular directions. Nodes previously identified as part of circuits exhibit strong engagement along specific directions, supporting the view that meaningful computations are embedded in low-rank subspaces. While some functional axes remain difficult to interpret, our results reveal that transformer components are more distributed, compact, and compositional than assumed. This opens a new direction for fine-grained mechanistic interpretability and the study of model behavior.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 19342
Loading