The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs

Jonas B Raedler; Weiyue Li; Alyssa Mia Taliotis; Manasvi Goyal; Siddharth Swaroop; Weiwei Pan

The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs

Jonas B Raedler, Weiyue Li, Alyssa Mia Taliotis, Manasvi Goyal, Siddharth Swaroop, Weiwei Pan

Published: 01 Jul 2025, Last Modified: 12 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Steering, Representation Engineering, LLM, AI, Social Bias

Abstract: Steering (inference-time modification of activations) offers a lightweight alternative to fine-tuning for aligning large language models (LLMs). While effective on targeted behaviors, we do not yet understand its effects on unrelated model behaviors. Here, we present a systematic comparison of steering across pretrained and fine-tuned models in the context of social bias. We find that in pretrained models, steering suppresses the intended (stereotypical) behavior, as expected. However, in fine-tuned models, steering primarily suppresses unrelated outputs, and this is both unexpected and undesired. This misalignment reveals aggregate metrics masks side-effects, highlighting the need for a focus on intervention fidelity (the degree to which an intervention impacts models as intended.) We hypothesize that this is due to fine-tuning increasing anisotropy of the latent space, entangling unrelated behaviors and thereby reducing steering precision.

Submission Number: 81

Loading