Two Samples Are Enough: Verbal Confidence Meets Self-Consistency in Reasoning LLMs

Two Samples Are Enough: Verbal Confidence Meets Self-Consistency in Reasoning LLMs

ICLR 2026 Conference Submission21120 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: confidence estimation, confidence elicitation, calibration, verbalized confidence, self-consistency, black-box confidence estimation, sampling uncertainty, uncertainty estimation, empirical evaluation, reasoning models, large language models

TL;DR: Combining verbalized confidence and self-consistency in reasoning LLMs shows that just two samples are enough to achieve strong and reliable confidence estimation. Additionaly, verbalized confidence on non-mathematical can be improved with prompting.

Abstract: Large reasoning models (LRMs) achieve strong problem-solving ability but also produce confident errors, making reliable uncertainty estimation essential. Prior work on standard large language models proposed two approaches: advanced \emph{verbalized confidence} (VC), where the model self-checks its chain-of-thought or directly reports its own certainty, and \emph{self-consistency} (SC), where the agreement across multiple stochastic samples of the answers to the same question indicates reliability. How these methods behave in LRMs, with long, rich, and internally branching reasoning traces, remains unclear. We present the first systematic evaluation of six VC methods, SC, and their hybrid (VCSC) across nine scientific benchmarks and three LRMs. We find that advanced VC instructions bring little benefit, sometimes reducing accuracy on mathematical tasks, and improving AUROC by about three percentage points on non-mathematical tasks. By contrast, VC-based parallel sampling and hybridization deliver dramatic gains: with just two repeats, VCSC improves AUROC by over 10 points on average. With larger budgets, parallel VC alone can approach perfect discrimination, as in the distilled DeepSeek model on AIME where AUROC reaches 1.0. These results establish VCSC as a simple, overlooked, and highly effective recipe for uncertainty estimation in LRMs, and deepen our understanding of how these models expose and exploit their own uncertainty.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21120

Loading