Fine-Tuning Enhances Latent Metacognitive Capability in Language Models

Tristan Day; Christopher Ackerman

Fine-Tuning Enhances Latent Metacognitive Capability in Language Models

Tristan Day, Christopher Ackerman

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Applications of interpretability, Interpretability for AI Safety, Methods (probing, steering, causal interventions)

Other Keywords: metacognition, introspection, uncertainty estimation, confidence reporting, abstention, delegation, probing, activation steering, causal interventions

TL;DR: Fine-tuning can amplify a limited latent metacognitive capability in language models by routing internal uncertainty signals into confidence reports and delegation decisions.

Abstract: Large language models are increasingly asked not only to answer questions, but also to judge whether they know enough to answer. We test a latent metacognitive capability hypothesis: models already contain internal structure supporting weak self-evaluation of answer uncertainty, and later training routes this structure into confidence reports and delegation decisions. This predicts that self-evaluation should track direct-answer uncertainty before task-specific fine-tuning, transfer between report and action formats, become more linearly recoverable from confidence-report states after training, and be causally affected by interventions on relevant directions. We test these predictions in Llama-3.1-8B using an Explicit Confidence Task (ECT), where stated confidence is compared to answer-option uncertainty from a separate direct-answer pass, and a Delegate Game (DG), where the model decides whether to answer or defer. The pre-trained model already shows above-chance alignment between stated confidence and direct-answer uncertainty, which improves after instruction tuning and LoRA fine-tuning. DG fine-tuning transfers to ECT despite using only binary answer/delegate labels. Mechanistically, confidence-report states increasingly contain direct-answer uncertainty information, confidence-report directions align with answer-certainty directions, and causal interventions affect confidence reports while largely preserving answer accuracy. These results support a limited form of latent metacognition: training routes internal uncertainty signals into self-evaluative reports and decisions.

Submission Number: 646

Loading