Instance-Semantic Attribution for Activation Steering in Large Language Models

ACL ARR 2026 January Submission8564 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Activation Steering, Large Language Model, Behavior Steering, Behavior Alignment
Abstract: Activation Steering guides large language models (LLMs) toward a target behavior by injecting vectors into the residual stream during inference, typically derived from behavioral instances consisting of a prompt paired with positive and negative responses. However, we observe that not all instances contribute equally: while some yield strong and consistent steering signals, others are noisy, unstable, or even misleading, and treating all instances uniformly can degrade overall steering performance. To address this issue, we propose Semantic-Aware Activation Steering (SAAS), a principled framework that performs instance-semantic attribution to assign differentiated importance to instance-level steering vectors based on their relevance to the target behavior. SAAS first assesses instance eligibility by measuring activation separability and directional consistency, and subsequently performs instance selection to retain only stable, behavior-relevant signals. It then uses LLM-based agents to decompose the target behavior into fine-grained semantic sub-behaviors and assess each instance’s alignment with them. Finally, the global steering vector is obtained via a weighted aggregation of instance-level vectors, with weights determined by these semantic alignment scores. Experiments across diverse behavior steering tasks demonstrate that SAAS consistently improves steering effectiveness and stability compared to existing activation steering methods.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8564
Loading