Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad; Abhinav Joshi; Ashutosh Modi

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: LLMs, Tranformer Circuits, Mechanistic Interpretability

TL;DR: We propose a new method for interpretating transformer circuit by performing SVD on query-value and value-output matrices

Abstract: Transformer-based language models exhibit complex behavior, but their internal computations remain poorly understood. Most mechanistic interpretability approaches treat components, such as attention heads and MLPs, as atomic units, ignoring potential functional substructure. We propose a finer-grained perspective that models components as superpositions of orthogonal singular directions. This perspective allows multiple independent computations to coexist within a single head or MLP, enabling selective intervention, attribution, and interpretation at a level of granularity beyond previous methods. We demonstrate this approach on the Indirect Object Identification (IOI) task, showing that well-known functional heads, like the “name mover,” encode overlapping subfunctions aligned with distinct singular directions. Nodes previously identified as part of circuits exhibit strong engagement along specific directions, supporting the view that meaningful computations are embedded in low-rank subspaces. While some functional axes remain difficult to interpret, our results reveal that transformer components are more distributed, compact, and compositional than assumed. This opens a new direction for fine-grained mechanistic interpretability and the study of model behavior.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 19342

Loading