Keywords: activation steering, momentum, alignment, behaviour control, mechanistic interpretability, representation engineering, optimization
TL;DR: We propose Momentum Steering, a momentum-based framework for activation steering in LLMs.
Abstract: Activation steering has emerged as a powerful approach for controlling large language models (LLMs), with prominent methods such as ActAdd, Directional Ablation, and Angular Steering relying on difference-in-means activations from contrastive prompts across layers. These differences are typically treated as candidate feature directions, later refined into optimal steering vectors or planes. In this work, we reinterpret these candidate directions as gradients of an underlying optimization problem. Building on this perspective, we propose Momentum Steering, a momentum-based framework for activation steering in LLMs. Unlike traditional difference-in-means methods, our framework generates a richer family of candidate directions through momentum updates, enabling more expressive steering. We first introduce a non-causal variant that accumulates difference-in-means signals via momentum, producing enhanced candidate directions. We then develop a causal variant, where future layer statistics are recursively influenced by previously applied momentum directions, explicitly modeling the causal effects of interventions on downstream activations. This recursive formulation yields more stable and consistent steering dynamics. Momentum Steering is lightweight and modular, making it easily compatible with state-of-the-art steering methods. We empirically demonstrate that Momentum Steering delivers consistently stronger, more robust, and more reliable behavioral control than existing approaches across diverse LLM families and benchmarks.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 4447
Loading