Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision
Keywords: calibration, reasoning, LLMs, self-verfication
TL;DR: CSFT trains LLMs to verbalize confidence, yielding better calibration, accuracy, self-verification, and cross-task generalization.
Abstract: Large language models (LLMs) are increasingly used as reasoning partners in domains such as mathematics, coding, and decision support, where reliable expression of confidence is essential for human–AI interaction. However, current LLMs often generate incorrect answers with high confidence, causing significant risks in downstream decision-making. Prior approaches for inducing verbalized confidence, including reinforcement learning or external probing, have shown limited generalization across complex reasoning and unseen tasks. We introduce Confidence-Supervised Fine-Tuning (CSFT), a simple method that trains models to output both an answer and an explicit confidence statement. CSFT substantially reduces calibration errors while also improving accuracy, induces emergent self-verification behaviors such as self-checking under low confidence, and reshapes token distributions into a locally smooth structure around correct answers. Furthermore, although trained only on reasoning tasks, CSFT generalizes to non-reasoning benchmarks such as MMLU. These results establish verbalized confidence as a scalable mechanism for improving calibration, reasoning, and generalization in LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23342
Loading