Automatic Calibration Diagnosis: Interpreting Probability Integral Transform (PIT) Histograms

Ondřej Podsztavek; Kai Lars Polsterer; Alexander I. Jordan

Automatic Calibration Diagnosis: Interpreting Probability Integral Transform (PIT) Histograms

Ondřej Podsztavek, Kai Lars Polsterer, Alexander I. Jordan

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: predictive uncertainty, calibration, probability integral transform (PIT) histogram, probabilistic machine learning, regression

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present a methodological concept for automatic calibration diagnosis of predictive models by interpreting the probability integral transform (PIT) histograms.

Abstract: Uncertainty quantification in predictive models is essential for safe decision-making and risk assessment. The predictive uncertainty is often represented by a predictive distribution because it is its most general representation. Optimising the sharpness of the distribution subject to its calibration is necessary. This work addresses the proper calibration of predictive distributions in regression tasks. We particularly focus on machine learning models, which are increasingly prevalent in real-world applications. We employ the probability integral transform (PIT) histogram to evaluate calibration quality. It can be used to diagnose calibration problems, e.g. under- or over-estimation, under- or over-dispersion, or an incorrect number of modes. However, PIT histograms are often difficult to interpret because multiple calibration problems may occur simultaneously. To tackle this issue, we present a methodological concept for the automatic interpretation of PIT histograms. It is based on a mixture density network interpreter trained with a synthetic data set of PIT histograms. Given a predictive model, data set, and corresponding PIT histogram, the interpreter can identify a probable observation-generating distribution. This allows us to diagnose a potential calibration problem by comparing the predictive with the probable observation-generating distribution. To showcase the power of the proposed concept in the automatic interpretation of PIT histograms, we referred to regression tasks on standard data sets. As a result, we could achieve notable improvements in the calibration of machine learning models.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6207

Loading