Angular Steering: Behavior Control via Rotation in Activation Space

Published: 10 Jun 2025, Last Modified: 30 Jun 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Mechanistic Interpretability, Activation Steering, Safety Alignment
TL;DR: This paper introduces Angular Steering, a robust and generalized method for fine-grained behavior control in language models, unifying and extending existing steering techniques through rotation in a feature-isolating subspace.
Abstract: Controlling specific behaviors in large language models while preserving general capabilities remains a key challenge for safe AI deployment. Current steering methods like vector addition and directional ablation are limited to two-dimensional subspaces, making them parameter-sensitive and prone to affecting unrelated features. We introduce Angular Steering, which modulates behavior by rotating activations within a fixed subspace, providing fine-grained control over behaviors like refusal and compliance. This geometric rotation framework generalizes existing techniques while simplifying parameter selection and maintaining model stability. Experiments demonstrate that Angular Steering achieves robust behavioral control with comparable language modeling performance across multiple model families. Our Adaptive Angular Steering variant further enhances stability by selectively rotating only aligned activations.
Submission Number: 76
Loading