Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

Yavuz Faruk Bakman; Zhiqi Huang; Chenyang Zhu; Anoop Kumar; Alfy Samuel; Daben Liu

Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

Yavuz Faruk Bakman, Zhiqi Huang, Chenyang Zhu, Anoop Kumar, Alfy Samuel, Daben Liu

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Uncertainty Quantification, LLMs, Epistemic Uncertainty

Abstract: Uncertainty Quantification (UQ) has become an essential tool for detecting hallucinations and unreliable outputs of large language models (LLMs), particularly in settings where external verification is infeasible, such as contextual question answering (QA). To quantify epistemic uncertainty, the model’s confusion to answer a question reliably, we introduce a generic token-level uncertainty measure defined as the cross-entropy between the distribution of the actual model and that of an ideal, most reliable hypothetical model. By decomposing this measure, we isolate the epistemic uncertainty and show that it can be bounded by the absence of model features in the actual model relative to the ideal one. We hypothesize that three features approximate this gap in contextual QA: \emph{honesty} (avoiding intentional lie), \emph{contextual reliance} (using the provided context rather than parametric knowledge), and \emph{contextual resolvability} (extracting relevant information from context). Using a top-down interpretability approach, we extract these features from only a small number of labeled samples and ensemble them to form a robust uncertainty score. Extensive experiments on multiple QA benchmarks demonstrate that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ approaches, achieving up to 13 PRR points improvement, while requiring no sampling and incurring negligible additional inference cost. Finally, we demonstrate the effectiveness and robustness of our method through extensive ablation studies.

Submission Number: 42

Loading