Uncertainty quantification in clinical settings: A retinal fundus screening study and benchmarking

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Responsible AI, Trustworthy AI, Uncertainty quantification, Computer-aided diagnosis, Benchmarking, clinical validation, Retinal fundus imaging
Abstract: We offer the most extensive benchmark for uncertainty quantification (UQ) in retinal AI screening, providing practical guidance for clinical evaluators/regulators and highlighting the importance of risk–coverage–accuracy analysis. We methodically assess six well-known post-hoc UQ techniques in three main diseases: glaucoma (115K+ images), age-related macular degeneration (29K+ images), and diabetic retinopathy (105K+ images). Our benchmark comprises three Vision Transformer variations, standardized train/test/calibration splits, and evaluation on both public datasets and in-house clinical data from a local hospital. Results show that screening models can be miscalibrated and overconfident, and although UQ is helpful, its benefits are highly method- and disease-dependent. Our risk–coverage–accuracy analysis shows coverage drastically decreases as risk limits increase, and no single approach is consistently dependable in all contexts. While neither method consistently outperforms the others, Deep Ensembles and Test-Time Augmentation (TTA) are the two practical UQ approaches that most frequently enhance selective prediction and/or calibration. Conformal Prediction (CP) serves as a must-have safety rail, ensuring alignment between nominal and observed coverage. However, no method can reliably achieve the 2\% target-risk required for autonomous screening without sacrificing coverage. These findings highlight the need for more robust post-hoc UQ methods, both for in-distribution scenarios and under domain shifts (out-of-distribution), as well as improved mechanisms for capturing disagreements and implementing policy-aware thresholding in human-in-the-loop workflows. To facilitate progress in this field, we release our benchmark, which includes standardized data splits, trained model checkpoints, code, and an online demo for interactive exploration, thereby providing a reference for future UQ research in ophthalmic AI screening.
Primary Area: datasets and benchmarks
Submission Number: 14201
Loading