Keywords: introspection, mechanistic interpretability, steering vectors, large language models, metacognition
TL;DR: We study how language models detect injected internal perturbations, showing that detection relies on distributed, post-training anomaly-detection mechanisms that can be substantially amplified by targeted interventions.
Abstract: Human metacognition converts internal cues into explicit reports (e.g., "something feels off"), supporting error correction and communication. We ask whether LLMs exhibit an analogous capacity and what mechanisms underlie it. In a controlled "thought injection" setting, we add concept steering vectors to the residual stream and ask models to detect the perturbation and identify its content. We find that (i) detection is behaviorally robust (moderate true positive rates, 0\% false positives) across diverse prompts and is strongest under post-trained assistant personas, indicating introspection is largely elicited by post-training rather than pretraining; (ii) detection is not reducible to a single linear confound, relies on distributed mid-to-late-layer MLP computation, and involves identifiable gate and evidence-carrier features; (iii) identification depends on partially distinct circuitry; and (iv) introspection is under-elicited by default: ablating refusal directions improves detection by $\sim$50\% and a trained steering vector by $\sim$75\%. Within a testable setting for metacognitive-style monitoring, our results reveal causally identifiable mechanisms and provide a target for interpretable and human-aligned AI reasoning.
Paper Type: New Full Paper
Submission Number: 67
Loading