Safety Cost of Steering Vectors Is Separable and Reducible

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Steering Vector, Robustness, LLMs Safety
TL;DR: We show that steering vectors' safety side effects arise from a separable component that can be isolated from the intended effect, and formulate a constrained optimization that removes it, which makes SV safer to deploy with little utility cost.
Abstract: Steering vectors are a lightweight tool for controlling LLM behavior. However, emerging evidence shows that steering vectors can unintentionally compromise a model's safety mechanisms and increase compliance with harmful requests, while no effective mitigation yet exists. In this work, we show that this safety degradation arises from a separable component in the vector that disrupts the model's safety mechanisms but contributes little to the steering objective. We identify and remove this safety-degrading component, formulating the task as a constrained optimization problem solved through primal-dual updates, subject to preserving the intended steering effect and bounding false refusal. The resulting solution is both interpretable and surgical: the optimization recovers a single direction whose ablation from the steering vector restores model safety with minimal utility cost. Across models, steering behaviors, and attack suites, including unseen attacks types, our method substantially reduces steering-induced safety degradation while preserving the original steering effect with minimal impact on false refusal. Our method offers a post-hoc correction to steering vectors that mitigates their safety cost, and more broadly, it provides a general recipe for applying activation-level model interventions without paying a safety tax.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 131
Loading