Endogenous Resistance to Activation Steering in Language Models

Published: 02 Mar 2026, Last Modified: 07 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: interpretability, activation steering, sparse autoencoders, self-correction, mechanistic interpretability, introspection
TL;DR: Llama-3.3-70B can spontaneously detect and recover from off-topic activation steering mid-generation, and we identify causal "off-topic detector" circuits whose ablation reduces this self-correction behavior.
Abstract: Activation steering is increasingly used for AI safety interventions, but its reliability depends on whether models can resist such steering. We find that large language models can spontaneously resist task-misaligned activation steering during inference, recovering mid-generation even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer Llama-3.3-70B and smaller models, ESR occurs substantially more in Llama-3.3-70B (3.8\% vs.\ $<$1\% for others). We identify 26 SAE latents causally linked to ESR: ablating them reduces ESR by 25\%. Meta-prompts enhance ESR 4$\times$, demonstrating controllability. These findings have dual implications for trustworthy AI: ESR could protect against adversarial manipulation but might undermine beneficial activation-based safety interventions.
Submission Number: 196
Loading