Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability for AI Safety, Applications of interpretability
Other Keywords: Large Language Model, Activation Steering, Jailbreak Attack
TL;DR: Benign activation steering improves deployment-time utility but can unintentionally erode LLM safety margins and amplify black-box jailbreak risk.
Abstract: Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as better truthfulness, or reasoning ability, without the need for retraining. Conceptually, these methods implement behavior control through hidden-state interventions, without changing the underlying model parameters. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed **Steering Externalities**, where steering vectors derived from benign datasets—such as reducing harmless refusals, improving structured-output following, truthfulness, and reasoning performance—inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80\% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering can erode the ``safety margin,'' rendering models more vulnerable to black-box attacks and indicating that inference-time utility improvements must be rigorously audited for unintended safety externalities.
Submission Number: 345
Loading