Keywords: LLMs, Uncertainty Quantification
TL;DR: We show theoretically and empirically that UQ methods for LLMs fail under ambiguity. We explore alternatives and release 2 datasets with ground truth probabilities
Abstract: Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is
critical for trustworthy deployment. While real-world language is inherently am-
biguous, existing UQ methods are typically benchmarked against tasks with no
ambiguity. In this work, we demonstrate that while current uncertainty esti-
mators perform well under the restrictive assumption of no ambiguity, they de-
grade to close-to-random performance on ambiguous data. To this end, we in-
troduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA)
datasets equipped with ground-truth answer distributions estimated from factual
co-occurrence. We find this performance deterioration to be consistent across dif-
ferent modeling paradigms: using the predictive distribution itself, internal repre-
sentations throughout the model, and an ensemble of models. We show that this
phenomenon can be explained theoretically, revealing that predictive-distribution
and ensemble-based estimators are fundamentally limited under ambiguity. Over-
all, our study reveals a key shortcoming of current UQ methods for LLMs and
motivates new approaches that explicitly model uncertainty during training.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20692
Loading