Emergent Mechanisms of Self-Awareness in LLMs

Published: 11 Nov 2025, Last Modified: 23 Dec 2025XAI4Science Workshop 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (Page limit: 3-5 pages)
Keywords: Mechanistic interpretability, emergent capabilities, AI Safety, Explainable AI
TL;DR: We discover that self-awareness in LLMs can be described by a single steering vector and can see that self-awareness happens very early on in training.
Abstract: Recent studies have revealed that LLMs can exhibit behavioral self-awareness — the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns, as it may allow models to intentionally conceal their true abilities during evaluation. We attempt to better understand this phenomenon by investigating how and when self-awareness emerges during fine-tuning, and whether it can be mechanistically localized. Through controlled fine-tuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that across domains and layers, fine-tuning consistently elicits self-aware behavior early on in the training process; (2) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; and (3) that the learned self-aware behavior is largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune’s behavioral effect. Together, these findings suggest that self-awareness exhibits a rapid transition and is captured by a linear direction rather than a distributed introspective mechanism.
Submission Number: 27
Loading