Keywords: Model Calibration · Uncertainty Quantification · Multi- Rater Modelling.
Abstract: Calibration, the property of producing predicted probabilities
that reflect true likelihoods of outcomes, is a relevant attribute
of medical image computing models and a key requirement in clinical
decision-making. However, empirical Calibration Error (CE) estimates
suffer from instability in data-scarce scenarios. Here, for any existing CE
we propose a Multi-Rater version of it (MR-CE), a wrapper over conventional
calibration metrics, which provides a new strategy for estimating
a CE that effectively addresses this limitation in situations where there
are multiple annotations per sample. MR-CEs offer more consistent estimates
of calibration errors by leveraging the consensus and disagreement
among multiple annotators to generate virtually extended test datasets,
more robust to typical binning artifacts.We evaluate a MR version of the
popular Expected Calibration Error (ECE), and also of the more recent
Kernel Density Estimation-ECE (kdeECE), in a comprehensive set of
classification and segmentation problems, demonstrating improved stability
compared to their single-rater CE counterparts. Specifically, we
show that MR-CEs achieve a reduced variability as the test set size decreases
across all analysed datasets. Our findings emphasize the critical
role of modelling inter-rater variability not only for training but also for
evaluating medical image analysis models, in particular when studying
the calibration of modern neural networks
Submission Number: 14
Loading