Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability

Published: 11 Nov 2025, Last Modified: 16 Jan 2026DAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-correction, interpretability, linear representations, large language models
Abstract: Intrinsic self-correction refers to the phenomenon where a language model refines its own outputs purely through prompting, without external feedback or parameter updates. While this approach improves performance across diverse tasks, its internal mechanism remains poorly understood. We analyze intrinsic self-correction from the representation shift induced by prompting. We formalize and introduce the notion of a prompt-induced shift, which is the change in hidden representations caused by a self-correction prompt. Across 5 open-source LLMs, prompt-induced shifts in text detoxification and text toxification align with latent directions constructed from contrastive pairs. In detoxification, the shifts align with the non-toxic direction; in toxification, they align with the toxic direction. These results suggest that intrinsic self-correction functions as representation steering along interpretable latent directions. Our analysis highlights an understanding of model internals can be a direct route to analyzing the mechanisms of prompt-driven LLM behaviors.
Submission Number: 5
Loading