Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Published: 12 Nov 2025, Last Modified: 12 Nov 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for closed-book hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, \texttt{uqlm}. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: **Terminology & notation** - Replaced “zero-resource” with “closed-book” throughout to reflect no retrieval/external-corpora/internet at scoring time. - Clarified notation in the decision rule: context variables, including inputs prompt and optional candidate responses are no longer denoted by $\theta$. **Scope** - Added a concise note that we use UQ-style signals as confidence scorers for generation-time hallucination detection, rather than full uncertainty modeling; we do not separate aleatoric vs. epistemic uncertainty. - Made evaluation scope explicit: results are in-domain. We do not evaluate OOD generalization or cross-dataset transfer of learned ensemble weights. **Updated experiments with current LLMs** - Re-ran experiments with currently available models: GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Gemini-2.5-Flash-Lite. - Outcomes remain consistent with the prior version (updated numbers reflected in tables and figures): the ensemble outperforms individual scorers; NLI remains the strongest black-box family in most settings; LLM-as-a-Judge performance tracks model accuracy; diminishing returns from additional sampled responses **Open-source repository** - Provided the package name, `uqlm`, and link: <https://github.com/cvs-health/uqlm>. - Added a maintenance statement: the repository is actively maintained; we welcome issues/PRs and will update integrations, add new scorers as research advances, and refresh examples as model/provider APIs evolve.
Code: https://github.com/cvs-health/uqlm
Supplementary Material: zip
Assigned Action Editor: ~Tal_Schuster1
Submission Number: 5688
Loading