Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie; Keenan Pepper; Stijn Servaes; Martin Leitgab; Murat Cubuktepe; Michael Vaiana; Diogo S de Lucena; Judd Rosenblatt; Michael S. A. Graziano

Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Michael Vaiana, Diogo S de Lucena, Judd Rosenblatt, Michael S. A. Graziano

Published: 02 Mar 2026, Last Modified: 07 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: interpretability, activation steering, sparse autoencoders, self-correction, mechanistic interpretability, introspection

TL;DR: Llama-3.3-70B can spontaneously detect and recover from off-topic activation steering mid-generation, and we identify causal "off-topic detector" circuits whose ablation reduces this self-correction behavior.

Abstract: Activation steering is increasingly used for AI safety interventions, but its reliability depends on whether models can resist such steering. We find that large language models can spontaneously resist task-misaligned activation steering during inference, recovering mid-generation even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer Llama-3.3-70B and smaller models, ESR occurs substantially more in Llama-3.3-70B (3.8\% vs.\ $<$1\% for others). We identify 26 SAE latents causally linked to ESR: ablating them reduces ESR by 25\%. Meta-prompts enhance ESR 4$\times$, demonstrating controllability. These findings have dual implications for trustworthy AI: ESR could protect against adversarial manipulation but might undermine beneficial activation-based safety interventions.

Submission Number: 196

Loading