Benchmarking Uncertainty Estimation in Large Language Model Replies for Natural Science Question Answering

ICLR 2026 Conference Submission19314 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Uncertainty, Calibration, Natural Science, benchmark, physics, instructon-tuned, reasoning
Abstract: Large language models (LLMs) are commonly used in question answering (QA) settings, including natural science and related research domains. Reliable uncertainty quantification (UQ) is critical for the trustworthy uptake of generated answers, yet existing approaches remain insufficiently validated in scientific QA. We introduce the first large-scale benchmark for evaluating UQ metrics in this setting, providing an extensible open-source framework to assess calibration across diverse models and datasets. Our study spans eleven LLM models in base, instruction-tuned and reasoning variants and covers eight scientific QA datasets, including both multiple-choice and arithmetic question answering tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. At the sequence level, we show that verbalized uncertainty estimates are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration, albeit at high computational cost. These findings expose critical limitations of current UQ methods for LLMs and highlight concrete opportunities for developing scalable, well-calibrated confidence measures for scientific QA.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19314
Loading