Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis

Aruna Sankaranarayanan; Amir Zur; Atticus Geiger; Dylan Hadfield-Menell

Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis

Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Steering, Causal interventions, AI Safety, Applications of interpretability

TL;DR: We localize concepts in free-form text based output settings using causal-mediation analysis

Abstract: Where should we intervene on internal activations of a large language model (LM) to control the free-form text it generates? Identifying effective steering locations is especially challenging when evaluation depends on a human or auxiliary LM, as such judgments are costly and yield only coarse feedback on the impact of an intervention. We introduce a signal for selecting steering locations by: (1) constructing contrastive responses exhibiting successful and unsuccessful steering, (2) computing the difference in generation probabilities between the two, and (3) approximating the causal effect of hidden activation interventions on this probability difference. We refer to this lightweight localization procedure as contrastive causal mediation (CCM). Across three case studies—refusal, sycophancy, and style transfer—we evaluate three CCM variants against probing and random baselines. All variants consistently outperform baselines in identifying attention heads suitable for steering. These results highlight the promise of causally grounded mechanistic interpretability for fine-grained model control.

Submission Number: 85

Loading