SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

ICLR 2026 Conference Submission15171 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: steering, alignment, interpretability, safety, bias, hallucination
Abstract: We introduce SteeringSafety, a systematic framework for evaluating representation steering methods across nine safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment, spanning 17 datasets. While prior work often highlights general capabilities of representation steering, we find there are many unexplored, specific, and important safety side-effects, and are the first to explore them in a systematic way. Our framework provides modularized building blocks for state of the art steering methods, enabling us to unify the implementation of a range of widely used steering methods such as DIM, ACE, CAA, PCA, and LAT. Importantly, this framework allows generalizing these existing steering methods with new enhancements, like conditional steering. Our results on Qwen-2.5-7B, Llama-3.1-8B, and Gemma-2-2B uncover that strong steering performance is dependent on the specific combination of steering method, model, and safety perspective, and that severe safety degradation can arise in poor combinations of these three. We find difference-in-means a generally consistent choice for steering models and note situations where slight increases in effectiveness trade off with severe entanglement, highlighting the need for systematic evaluations in LLM safety.
Primary Area: interpretability and explainable AI
Submission Number: 15171
Loading