Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
TL;DR: We introduce prompt-only steering vector. It achieves SOTA steering performance without post-hoc factor selection; compared to FSSV, it yields a better tradeoff between model utility and adversarial robustness.
Abstract: Recently, *steering vectors (SVs)* have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones.
However, current approaches to fine-tuned SVs suffer from two limitations.
First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time.
Second, they operate as *full-sequence SVs (FSSVs)*, which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process.
To address the first limitation, we propose *joint training* of steering factors and directions, such that post-hoc factor selection is no longer required.
Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training.
To tackle the second limitation, we draw inspiration from *representation fine-tuning* and introduce **Prompt-Only Steering Vector (PrOSV)**, an SV that intervenes only on a few prompt tokens.
Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme.
We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.
Lay Summary: Large language models can now follow instructions, answer questions, and generate text with impressive fluency. However, controlling their behavior remains difficult: common methods either rely on fragile prompts or require expensive retraining of the entire model. A promising alternative is "steering vectors," which modify the model’s internal representations during generation. Unfortunately, existing steering vector methods often require extensive manual tuning and can noticeably harm the model's general abilities by intervening on every generated token.
In this work, we develop a more principled way to train steering vectors. Using mathematical tools from neural network scaling theory, we derive practical rules for how to initialize and optimize steering vectors so that they can be trained reliably without costly trial-and-error tuning. We also introduce Prompt-Only Steering Vectors (PrOSV), which steer the model by modifying only a few input tokens instead of intervening throughout the entire generation process.
Our experiments show that PrOSV achieves stronger and more reliable control of language models while better preserving their reasoning and instruction-following abilities. We hope this work contributes to safer, more efficient, and more interpretable methods for controlling AI systems.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/ZJU-OmniAI/prosv
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: steering vector, representation steering, activation steering, scaling theory
Originally Submitted PDF: pdf
Submission Number: 2655
Loading