Some hidden traps of confidence intervals in medical image segmentation: coverage issues

Pascaline André; Charles Heitz; Evangelia Christodoulou; Annika Reinke; Carole H. Sudre; Michela Antonelli; M. Jorge Cardoso; Antoine Gilson; Sophie Tezenas du Montcel; Gaël Varoquaux; Lena Maier-hein; Olivier Colliot

Some hidden traps of confidence intervals in medical image segmentation: coverage issues

Pascaline André, Charles Heitz, Evangelia Christodoulou, Annika Reinke, Carole H. Sudre, Michela Antonelli, M. Jorge Cardoso, Antoine Gilson, Sophie Tezenas du Montcel, Gaël Varoquaux, Lena Maier-hein, Olivier Colliot

Published: 25 Jul 2025, Last Modified: 25 Jul 2025BRIDGE 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Medical imaging, Validation, Confidence intervals, Segmentation

TL;DR: We studied the coverage properties of the most common confidence interval methods for assessing uncertainty in performance of segmentation models and unveiled important pitfalls

Abstract: Medical imaging AI models are usually assessed by reporting an empirical summary statistic of the performance metric, most commonly the mean or median. Recent work has shown that most studies overlook the uncertainty of these estimates, potentially leading to misleading conclusions and hampering clinical translation of medical imaging AI models. To address this issue, systematic reporting of confidence intervals (CIs) has been recommended, but numerous different CI methods exist, and there is very little literature on their behavior in medical imaging. A fundamental property of a CI method is its coverage. This paper contributes towards filling this literature gap in the context of medical image segmentation, studying the coverage of five CI methods for the two arguably most common summary statistics, the mean and the median. To that purpose, we perform a large-scale analysis of CI coverage using non-parametric simulations based on benchmarks instances representing diverse real-world distributions of two common segmentation metrics (Dice similarity coefficient and normalized surface distance). For the mean, all CI methods have decent coverage for most instances when sample sizes exceed 50, even though there are exceptions. For CIs of the median, we unveil major pitfalls: two common bootstrap CI methods have a catastrophic behavior on average whereas another only fails on very degenerate distributions. We believe these pitfalls are important to communicate to the community and that these findings will contribute to future efforts to provide standardized guidelines on confidence interval reporting in medical imaging AI.

Submission Number: 3

Loading