Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie; Keenan Pepper; Stijn Servaes; Martin Leitgab; Murat Cubuktepe; Michael Vaiana; Diogo S de Lucena; Judd Rosenblatt; Michael S. A. Graziano

Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Michael Vaiana, Diogo S de Lucena, Judd Rosenblatt, Michael S. A. Graziano

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: interpretability, activation steering, sparse autoencoders, self-correction, mechanistic interpretability, introspection

TL;DR: Llama-3.3-70B can spontaneously detect and recover from off-topic activation steering mid-generation, and we identify causal "off-topic detector" circuits whose ablation reduces this self-correction behavior.

Abstract: Large language models can resist task-misaligned activation steering during inference, recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer Llama-3.3-70B and smaller models, we find ESR occurs substantially more in Llama-3.3-70B (3.8\% vs.\ $<$1\% for others). We identify 26 SAE latents causally linked to ESR: ablating them reduces ESR by 25\%. Meta-prompts enhance ESR 4$\times$, demonstrating controllability. These findings have dual safety implications: ESR could protect against adversarial manipulation but might interfere with beneficial activation-based safety interventions.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 31

Loading