Keywords: Steering, Causal interventions, AI Safety, Applications of interpretability
TL;DR: We localize concepts in free-form text based output settings using causal-mediation analysis
Abstract: Where should we intervene on internal activations of a large language model (LM) to control the free-form text it generates? Identifying effective steering locations is especially challenging when evaluation depends on a human or auxiliary LM, as such judgments are costly and yield only coarse feedback on the impact of an intervention. We introduce a signal for selecting steering locations by: (1) constructing contrastive responses exhibiting successful and unsuccessful steering, (2) computing the difference in generation probabilities between the two, and (3) approximating the causal effect of hidden activation interventions on this probability difference. We refer to this lightweight localization procedure as contrastive causal mediation (CCM). Across three case studies—refusal, sycophancy, and style transfer—we evaluate three CCM variants against probing and random baselines. All variants consistently outperform baselines in identifying attention heads suitable for steering. These results highlight the promise of causally grounded mechanistic interpretability for fine-grained model control.
Submission Number: 85
Loading