Black-Box Uncertainty Quantification for Large Language Models via Ensemble-of-Ensembles

Wang Ma; Debarun Bhattacharjya; Junkyu Lee; Nhan H Pham; Harsha Kokel; Qiang Ji

Black-Box Uncertainty Quantification for Large Language Models via Ensemble-of-Ensembles

Wang Ma, Debarun Bhattacharjya, Junkyu Lee, Nhan H Pham, Harsha Kokel, Qiang Ji

Published: 06 Nov 2025, Last Modified: 10 Dec 2025AIR-FM PosterEveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Keywords: Uncertainty Quantification, Black-box Methods, Uncertainty Decomposition, Pertrubation-based Methods

TL;DR: We propose a two-level ensemble framework for black-box uncertainty quantification in large language models, decomposing total uncertainty into aleatoric and epistemic components, achieving reliability comparable to white-box and black-box baselines.

Abstract: Uncertainty quantification (UQ) is essential for building reliable and trustworthy large language models (LLMs). However, conventional Bayesian or ensemble-based UQ methods are computationally intractable at the scale of modern LLMs and often require white-box access to model parameters or logits. This paper introduces a two-level ensemble framework for black-box uncertainty estimation that operates entirely at inference time, without retraining or architectural modification. The method is theoretically grounded in the law of total variance, decomposing total predictive uncertainty into aleatoric and epistemic components. The inner ensemble captures stochasticity and ambiguity through repeated stochastic decoding, while the outer ensemble approximates parameter uncertainty via semantically perturbed prompts that serve as proxy samples from the implicit posterior. By measuring variance in a continuous embedding space, our framework yields interpretable and scalable uncertainty estimates across diverse LLMs. Experiments on the TriviaQA and TruthfulQA benchmarks demonstrate that our black-box estimator achieves AUROC performance comparable to or surpassing state-of-the-art white-box baselines, while offering meaningful uncertainty decomposition that distinguishes linguistic ambiguity from knowledge uncertainty.

Submission Track: Workshop Paper Track

Submission Number: 38

Loading