Keywords: interpretability, activation steering, sparse autoencoders, self-correction, mechanistic interpretability, introspection
TL;DR: Llama-3.3-70B can spontaneously detect and recover from off-topic activation steering mid-generation, and we identify causal "off-topic detector" circuits whose ablation reduces this self-correction behavior.
Abstract: Large language models can resist task-misaligned activation steering during inference, recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer Llama-3.3-70B and smaller models, we find ESR occurs substantially more in Llama-3.3-70B (3.8\% vs.\ $<$1\% for others). We identify 26 SAE latents causally linked to ESR: ablating them reduces ESR by 25\%. Meta-prompts enhance ESR 4$\times$, demonstrating controllability. These findings have dual safety implications: ESR could protect against adversarial manipulation but might interfere with beneficial activation-based safety interventions.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 31
Loading