Keywords: Mechanistic Interpretability, Activation Steering, Attribution Graph, Large Language Models, Efficient Intervention
Abstract: Activation steering has emerged as an effective and economical technique for behavior control in large language models (LLMs). Despite growing interest, existing methods typically rely on exhaustive layer-wise interventions to identify effective steering locations. This process is computationally expensive and lacks interpretability. To mitigate this problem, we propose Attribution-to-Action (A2A), an efficient layer selection framework that leverages mechanistic interpretability to identify layers where steering is most impactful. Specifically, A2A first constructs an attribution graph that traces how internal pathways contribute to model outputs, guided by a small set of contrastive behavior data. Subsequently, edge-level attribution weights are aggregated at the node level and then combined within each layer to derive an importance ranking. Steering vectors are applied only to the top-ranked layers, effectively reducing the search space. Experiments on behavior control tasks such as personality conditioning and model jailbreaking demonstrate that A2A achieves performance comparable to exhaustive search while requiring significantly fewer interventions and offering improved interpretability.
Primary Area: interpretability and explainable AI
Submission Number: 24860
Loading