Calibrating the Calibration Tester: Optimal Binning and Minimax Calibration Testing for Continuous Predictive Models

Published: 13 Apr 2026, Last Modified: 13 Apr 2026Calibration for Modern AI @ AISTATS 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Calibration testing, Probability Integral Transform (PIT), Minimax optimality, Uniformity testing, Histogram binning
TL;DR: We derive the maximum number of PIT histogram bins for calibration testing that balances discretization bias vs. statistical noise, and provide a minimax optimal test.
Abstract: Evaluating the calibration of continuous predictive models relies heavily on the binned Probability Integral Transform (PIT). Practitioners routinely use arbitrary bin counts ($N$) and standard $\chi^2$ goodness-of-fit tests, which lack theoretical guarantees against worst-case calibration errors, especially in the large-$N$ regime or when the sample size fluctuates. In this work, we translate recent advances in minimax uniformity testing to the machine learning calibration setting. By mapping regression Expected Calibration Error ($\ECE$) to the $\ell_p$ distance from uniformity, we provide two rigorous tools for practitioners: (1) a formula for the maximum allowable bins $N_{\text{max}}$ to guarantee detection of a target $\ECE$, and (2) a test that achieves minimax optimal rates. Finally, we discuss the trade-off between discretization bias and statistical noise, and demonstrate how the formula for $N_{\text{max}}$ provides a principled way to choose $N$ that balances these two effects.
Submission Number: 31
Loading