Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
Abstract: Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence—misalignment between predicted confidence and true correctness—poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations—targeted fine-tuning, structured prompting, and strategic model choice—to ensure reliable, trustworthy LLM deployments.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In response to the Action Editor’s recommendations, we made the following targeted revisions in the camera-ready version of the paper:
1. Clarified methodological justification for elicited confidence (Section 3.2).
We added a dedicated paragraph titled “Why Elicited Confidence?” that explains our rationale for relying on elicited confidence scores rather than logit-based measures. This revision incorporates supporting literature on the reliability of verbalized confidence in RLHF-tuned systems, discusses practical constraints related to uniform logit access across APIs.
2. Added explicit discussion of generator/evaluator bias (Limitations Section).
To address concerns about potential systemic bias arising from using GPT-4o-mini for both distractor generation and answer adjudication, we have created a Limitations section. The revised text now explicitly frames our findings as conditional on this fixed pipeline and documents the recommended “alternating LLM” validation strategy as important future work to assess generalizability.
These changes directly address the methodological and validity concerns raised during review and strengthen the transparency and reproducibility of our evaluation framework.
Code: https://github.com/prateekchhikara/llms-calibration
Assigned Action Editor: ~Hanie_Sedghi1
Submission Number: 5784
Loading