Keywords: Circuit Analysis, Attribution Graphs, Methods (probing, steering, causal interventions), Feature Geometry
TL;DR: What makes an attention head special is not the head itself, but whether its OV-routed contribution aligns with an invariant computation; in factored settings, the true computational unit is often the aggregate write into a residual-stream subspace.
Abstract: Mechanistic analyses of transformers often proceed head-by-head, looking for attention maps that correspond to computational roles. But when should an attention head be interpreted as a computational unit? In principle, multi-head attention need not organize computation in such a clean way. A single computational function may be distributed across heads, and a single head may participate in several computations, with invariants appearing only after OV routing and residual-stream summation. In natural language models, this is hard to resolve because the underlying features and computations are unknown, and an apparently uninterpretable head may be polysemantic, part of a distributed circuit, or a sign that the analyst chose the wrong decomposition. To get a handle on this issue, we study a controlled setting where the target computation is analytically specified: transformers trained on factored sequence processes. These processes require independent predictive updates for multiple latent factors, represented in orthogonal residual-stream subspaces. This lets us ask whether the attention circuits implementing these independent updates are themselves factorized. We find that they need not be. Per-factor updates constrain only the aggregate routed contribution across heads, leaving substantial freedom in how computation is distributed. Heads may specialize to individual factors, compose with other heads to implement one factor, or contribute polysemantically across factor boundaries. The regime that emerges depends on the generator's spectrum, the head budget, and training dynamics. We introduce \emph{effective subspace attention}, a scalar quantity that combines attention patterns with OV routing to recover invariant subspace-level contributions when individual maps are illegible. Our results show that independent computations can be implemented by shared attention circuits, and that individual heads may look uninterpretable even when the collective routed circuit implements a precise, theoretically predicted computation.
Submission Number: 528
Loading