Quantifying the Consistency, Fidelity, and Reliability of LLM Verbalized Confidence

ACL ARR 2025 May Submission4492 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) can be prompted to express their confidence in answers to a given query, referred to as \emph{verbalized confidence}, to help users assess trustworthiness. However, its poor calibration with factual accuracy raises questions about whether it can be trusted. To address this, we formulate a set of test metrics to evaluate verbalized confidence across a large span of LLMs along three dimensions: \textbf{\emph{Consistency}}--how stable the confidence is across diverse prompts eliciting confidence in different formats, e.g., numerical scales; \textbf{\emph{Fidelity}}--is the model faithful to its own answers, e.g., more confident about them than about counterfactual answers; \textbf{\emph{Reliability}}--how well the stated confidence aligns with the answer correctness. Our findings reveal that GPT-4o, which provides the most consistent and reliable confidence, performs sub-optimally on fidelity compared to smaller models. Furthermore, all LLMs are generally most confident about their original answers, even compared to higher-quality gold responses. Reliability is highly sensitive to the prompt format and the chosen calibration metric. Thus, we conclude that each evaluation dimension captures a distinct aspect of model trustworthiness.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: calibration/uncertainty
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 4492
Loading