Multi-Rater Calibration Error Estimation

Meritxell Riera-Marin; Javier Garcia Lopez; Julia Rodriguez-Comas; Miguel Angel Gonzalez Ballester; Adrian Galdran

Multi-Rater Calibration Error Estimation

Meritxell Riera-Marin, Javier Garcia Lopez, Julia Rodriguez-Comas, Miguel Angel Gonzalez Ballester, Adrian Galdran

17 Sept 2025 (modified: 17 Sept 2025)MICCAI 2025 Workshop UNSURE SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Calibration · Uncertainty Quantification · Multi- Rater Modelling.

Abstract: Calibration, the property of producing predicted probabilities that reflect true likelihoods of outcomes, is a relevant attribute of medical image computing models and a key requirement in clinical decision-making. However, empirical Calibration Error (CE) estimates suffer from instability in data-scarce scenarios. Here, for any existing CE we propose a Multi-Rater version of it (MR-CE), a wrapper over conventional calibration metrics, which provides a new strategy for estimating a CE that effectively addresses this limitation in situations where there are multiple annotations per sample. MR-CEs offer more consistent estimates of calibration errors by leveraging the consensus and disagreement among multiple annotators to generate virtually extended test datasets, more robust to typical binning artifacts.We evaluate a MR version of the popular Expected Calibration Error (ECE), and also of the more recent Kernel Density Estimation-ECE (kdeECE), in a comprehensive set of classification and segmentation problems, demonstrating improved stability compared to their single-rater CE counterparts. Specifically, we show that MR-CEs achieve a reduced variability as the test set size decreases across all analysed datasets. Our findings emphasize the critical role of modelling inter-rater variability not only for training but also for evaluating medical image analysis models, in particular when studying the calibration of modern neural networks

Submission Number: 14

Loading