A2A: Mechanistic Analysis for Efficient Layer Selection in Activation Steering

Yuhao Dan; Jiaju Lin; Qin Chen; Jie Zhou; Xin Li; Liang He

A2A: Mechanistic Analysis for Efficient Layer Selection in Activation Steering

Yuhao Dan, Jiaju Lin, Qin Chen, Jie Zhou, Xin Li, Liang He

20 Sept 2025 (modified: 03 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Activation Steering, Attribution Graph, Large Language Models, Efficient Intervention

Abstract: Activation steering has emerged as an effective and economical technique for behavior control in large language models (LLMs). Despite growing interest, existing methods typically rely on exhaustive layer-wise interventions to identify effective steering locations. This process is computationally expensive and lacks interpretability. To mitigate this problem, we propose Attribution-to-Action (A2A), an efficient layer selection framework that leverages mechanistic interpretability to identify layers where steering is most impactful. Specifically, A2A first constructs an attribution graph that traces how internal pathways contribute to model outputs, guided by a small set of contrastive behavior data. Subsequently, edge-level attribution weights are aggregated at the node level and then combined within each layer to derive an importance ranking. Steering vectors are applied only to the top-ranked layers, effectively reducing the search space. Experiments on behavior control tasks such as personality conditioning and model jailbreaking demonstrate that A2A achieves performance comparable to exhaustive search while requiring significantly fewer interventions and offering improved interpretability.

Primary Area: interpretability and explainable AI

Submission Number: 24860

Loading