Keywords: Brier score decomposition, calibration, isotonic regression, predictive performance, probabilistic multi-class classification
TL;DR: The paper presents a case study that evaluates class-wise recalibration approaches based on histogram binning and isotonic regression using a decomposition of the Brier score.
Abstract: Decompositions of proper scores into measures of miscalibration (reliability), discrimination (resolution), and uncertainty have a long history in weather forecasting. In machine learning (ML), related calibration error metrics are now seeing a surge of interest. In this note, I review the close connection between these concepts and present a small case study on image classifiers from the literature. The study exemplifies that an exclusive focus on calibration error may lead to questionable conclusions when improvements in calibration come at the expense of a drastic decline in overall predictive performance. I critically examine histogram binning and show that isotonic regression produces better overall recalibration results. A simple linear interpolation of the isotonic fit is shown to further improve predictive performance without loss of calibration.
Submission Number: 23
Loading