Abstract: Confidence estimation techniques are often used to better gauge the answers given by a Large Language Model (LLM). One such technique is $\textit{verbalized confidence}$. This prompting setup produces confidence scores alongside the actual answers, but the mechanisms behind these self-reported confidence values remain poorly understood. This paper presents a comprehensive analysis of verbalized confidence across multiple datasets spanning factual questions, multiple-choice QA, and causal reasoning using four different LLMs.
Our investigation reveals that verbalized confidence scores are $\textit{highly quantized}$, clustering around specific values (e.g., 0, 90, 100) with minimal differentiation between correct and incorrect answers. Through causal mediation analysis and targeted input perturbations, we demonstrate that confidence score generation is primarily influenced by structural prompt elements like the word $``confidence''$ and the specified scale range rather than the actual question's content.
These findings provide valuable insights into the behavior of verbalized confidence and underscore the importance of developing more reliable self-evaluation mechanisms for LLMs.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jasper_Snoek1
Submission Number: 8655
Loading