Keywords: steering, interpretability, localization, causal mediation analysis, control
TL;DR: We show that steering models through contrastive signals from long-form text can lead to effective model control.
Abstract: Where should we intervene in a language model (LM) to control behaviors that are diffuse across numerous tokens? To answer this question, we introduce Contrastive Causal Mediation (CCM), a procedure for selecting steerable model components from long-form responses. In CCM, we construct a dataset of contrasting inputs and LM responses that define a goal for the intervention, such as generating text in verse instead of prose. We then quantify how model components mediate the effect of the contrastive input signal on producing the contrasting LM responses and select the strongest mediators for steering. We evaluate CCM across three tasks—refusal, sycophancy, and style transfer—and three models, and find that it consistently outperforms correlational baselines that use probes to select attention heads for steering. Moreover, a lightweight CCM variant using a gradient approximation technique achieves equivalent performance. Finally, we show that while steering all attention heads succeeds on held-in test data, only steering a localized set of attention heads produces effects that generalize to held-out test datasets. Together, these results demonstrate how causally grounded mechanistic interpretability can enable effective control of LMs generating long-form text.
Primary Area: interpretability and explainable AI
Submission Number: 17187
Loading