Component and Dimension Sparsity in Transformer Refusal Mechanisms
Keywords: interpretability, activation steering, components, dimensions, sparsity
Abstract: Activation steering manipulates large language model behavior by intervening on internal activations, but the mechanistic basis of these interventions remains poorly understood. We decompose refusal steering into component-level interventions across four open-weight models, identifying which attention and MLP components can reproduce the full steering effect. We find that the refusal direction is concentrated in sparse mechanisms comprising 28--48% of upstream components while retaining 88--101% of full steering effectiveness, with complement experiments confirming that excluded components largely fail to elicit refusal when steered in isolation. Within these mechanisms, effective steering further concentrates in approximately 50% of residual stream dimensions, consistent with a privileged basis structure, and this coordinate-level sparsity persists when steering is restricted to the identified component set. Decomposing each component's contribution to the refusal direction reveals that attention sublayers provide the dominant discriminative signal between harmful and harmless content, while MLP contributions are more distributed and in some cases suppressive. The high-magnitude writes of individual components do not concentrate in the same coordinate subspace as the refusal direction, suggesting that the refusal direction basis emerges as an aggregate of component-specific sparse signals rather than a shared subspace. Together these findings show that the refusal direction is not diffusely encoded across a transformer but assembled by a structured, identifiable mechanism, providing a foundation for mechanistic understanding of how safety-relevant behaviors are represented and steered.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 209
Loading