Beyond Binary Evaluation: Measuring Language Model Hallucinations Through Distributional Correctness

ICLR 2026 Conference Submission20881 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, evaluation, metrics, hallucination
TL;DR: introduces a new LLM metric for implicit hallucination-correction of common benchmarks
Abstract: Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward ``I don't know'' responses. We introduce a novel evaluation metric to solve this problem of not considering a model's entire probability distribution over answer choices. Our metric naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, we demonstrate our metric offers a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. We then adapt 12 existing evaluation benchmarks to our metric's variants and measure performance on six language models, showing that for half of the tested benchmarks scores are *negative across all tested models*, indicating significant tendencies towards hallucination.
Primary Area: datasets and benchmarks
Submission Number: 20881
Loading