Mechanisms of Introspective Awareness

Published: 02 Mar 2026, Last Modified: 02 Mar 2026LIT Workshop @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: introspection, mechanistic interpretability, steering vectors, large language models, metacognition
TL;DR: We study how language models detect injected internal perturbations, showing that detection relies on distributed, post-training anomaly-detection mechanisms that can be substantially amplified by targeted interventions.
Abstract: A growing body of work explores computation that happens implicitly in hidden representations rather than through explicit chain-of-thought (CoT). We study a controlled implicit-reasoning setting in which concept steering vectors are added to the residual stream and the model is asked to detect the perturbation and identify its content. We find that (i) detection is behaviorally robust (moderate true positives, 0\% false positives) across diverse prompts and strongest under post-trained assistant personas, indicating the capability is largely elicited by post-training rather than pretraining; (ii) detection is not reducible to a single linear confound, relies on distributed mid-to-late-layer MLP computation, and involves identifiable gate and evidence-carrier features; (iii) identification depends on partially distinct circuitry; and (iv) the capability is under-elicited by default: ablating refusal directions improves detection by $\sim$50\% and a trained steering vector by $\sim$75\%. Our results reveal causally identifiable mechanisms underlying implicit anomaly detection and provide a concrete target for understanding and engineering latent LLM reasoning.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Uzay_Macar1
Format: Yes, the presenting author will definitely attend in person because they attending ICLR for other complementary reasons.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 46
Loading