Abstract: We introduce STEERINGSAFETY, a benchmark for evaluating representation steering methods across nine safety perspectives spanning 18 datasets. While prior work highlights the general capabilities of representation steering, we focus on safety perspectives including refusal, bias, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. STEERINGSAFETY provides modularized building blocks for state-of-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements such as conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B show that strong steering performance depends on the pairing of method, model, and specific perspective. For instance, DIM is consistently effective, yet all methods exhibit substantial entanglement, where improving effectiveness on one safety perspective often significantly changes performance on others. Social behaviors are most vulnerable (degradation up to 76%), refusal steering (jailbreaking) frequently compromises normative judgment such as commonsense morality (up to 26%), and hallucination steering shifts political views unpredictably across models, ranging from a 21% shift to the right to a 19% shift to the left. These findings show the need to understand steering methods through multiple safety angles rather than a single target behavior.
Lay Summary: Current methods for steering LLMs, i.e., modifying LLM performance based on changing the expression of a vector, are not evaluated uniformly across safety-relevant benchmarks. Importantly, even when they are evaluated, they are often focused on just one behavior at a time, e.g., decreasing bias. Often ignored is that modifying how models exhibit just one behavior could cause large spillovers where other behaviors are also affected.
To address this, we create a standardized evaluation setup across five common steering methods and nine safety perspectives. We then benchmarked not only how effective a steering method was at modifying a behavior, but also how much that modification affected other behaviors.
Our findings suggest that there is substantial spillover and that there is no clear pattern, with the methods, models, and behaviors all factoring into the results. This suggests there is much room left to both understand why this occurs as well as build methods that are more easily controllable without as many side effects.
Link To Code: https://github.com/wang-research-lab/SteeringSafety
Primary Area: Social Aspects->Alignment
Keywords: steering, alignment, interpretability, safety, bias, hallucination
Originally Submitted PDF: pdf
Submission Number: 18812
Loading