Abstract: Large language models (LLMs) can be prompted to express their confidence in answers to a given query, referred to as \emph{verbalized confidence}, to help users assess trustworthiness.
However, its poor calibration with factual accuracy raises questions about whether it can be trusted. To address this, we formulate a set of test metrics to evaluate verbalized confidence across a large span of LLMs along three dimensions: \textbf{\emph{Consistency}}--how stable the confidence is across diverse prompts eliciting confidence in different formats, e.g., numerical scales;
\textbf{\emph{Fidelity}}--is the model
faithful to its own answers, e.g.,
more confident about them than
about counterfactual answers;
\textbf{\emph{Reliability}}--how well the stated confidence aligns with the answer correctness.
Our findings reveal that GPT-4o, which provides the most consistent and reliable confidence, performs sub-optimally on fidelity compared to smaller models. Furthermore, all LLMs are generally most confident about their original answers, even compared to higher-quality gold responses. Reliability is highly sensitive to the prompt format and the chosen calibration metric. Thus, we conclude that each evaluation dimension captures a distinct aspect of model trustworthiness.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: calibration/uncertainty
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 4492
Loading