JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

ICLR 2026 Conference Submission21308 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Uncertainty Quantification, Epistemic Uncertainty, Aleatoric Uncertainty, Calibration, Classification, Deep Ensembles, predictive distribution, predictive sets

TL;DR: JUCAL is a lightweight method that balances the ratio of aleatoric and epistemic uncertainty in ensembles, improving predictive reliability over existing calibration methods.

Abstract: We study post-calibration uncertainty for trained ensembles of classifiers. Specifically, we consider both aleatoric uncertainty (i.e., label noise) and epistemic uncertainty (i.e., model uncertainty). Among the most popular and widely used calibration methods in classification are temperature scaling (i.e., *pool-then-calibrate*) and conformal methods. However, the main shortcoming of these calibration methods is that they do not balance the proportion of aleatoric and epistemic uncertainty. Nevertheless, not balancing epistemic and aleatoric uncertainty can lead to severe misrepresentation of predictive uncertainty, i.e., can lead to overconfident predictions in some input regions while simultaneously being underconfident in other input regions. To address this shortcoming, we present a simple but powerful calibration algorithm *Joint Uncertainty Calibration (JUCAL)* that jointly calibrates aleatoric and epistemic uncertainty. JUCAL jointly calibrates two constants to weight and scale epistemic and aleatoric uncertainties by optimizing the *negative log-likelihood (NLL)* on the validation/calibration dataset. JUCAL can be applied to any trained ensemble of classifiers (e.g., transformers, CNNs, or tree-based methods), with minimal computational overhead, without requiring access to the models' internal parameters. We experimentally evaluate JUCAL on various text classification tasks, for ensembles of varying sizes and with different ensembling strategies. Our experiments show that JUCAL significantly outperforms SOTA calibration methods across all considered classification tasks, reducing NLL and predictive set size by up to 15\% and 20\%, respectively. Interestingly, even applying JUCAL to an ensemble of size 5 can outperform temperature-scaled ensembles of size up to 50 in terms of NLL and predictive set size, resulting in up to 10 times smaller inference costs. Thus, we propose JUCAL as a new go-to method for calibrating ensembles in classification.

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Submission Number: 21308

Loading