Keywords: calibration error, uncertainty estimation, statistical bias
Abstract: Building reliable machine learning systems requires that we correctly understand their level of confidence. Calibration focuses on measuring the degree of accuracy in a model's confidence and most research in calibration focuses on techniques to improve an empirical estimate of calibration error, $\mathrm{ECE}_\mathrm{BIN}$. Using simulation, we show that $\mathrm{ECE}_\mathrm{BIN}$ can systematically underestimate or overestimate the true calibration error depending on the nature of model miscalibration, the size of the evaluation data set, and the number of bins. Critically, $\mathrm{ECE}_\mathrm{BIN}$ is more strongly biased for perfectly calibrated models. We propose a simple alternative calibration error metric, $\mathrm{ECE}_\mathrm{SWEEP}$, in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function. Evaluating our measure on distributions fit to neural network confidence scores on CIFAR-10, CIFAR-100, and ImageNet, we show that $\mathrm{ECE}_\mathrm{SWEEP}$ produces a less biased estimator of calibration error and therefore should be used by any researcher wishing to evaluate the calibration of models trained on similar datasets.
One-sentence Summary: We highlight estimation bias in standard calibration error metrics and propose a less biased metric based on monotonic binning.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2012.08668/code)
Reviewed Version (pdf): https://openreview.net/references/pdf?id=0Klv_cNxhZ
18 Replies
Loading