BEYOND AUROC: EVALUATING TEMPORAL STABILITY, FALSE-POSITIVE LOAD, AND UNCERTAINTY CALIBRATION IN CAPSULE ENDOSCOPY VIDEO AI

Published: 12 Mar 2026, Last Modified: 12 Mar 2026videoai.isbi2026 OralEveryoneRevisionsCC BY 4.0
Keywords: Capsule Endoscopy, Medical Video Analysis, Temporal Stability, False-Positive Analysis, Uncertainty Quantification Reliability Evaluation
TL;DR: Near-ceiling AUROC can hide failures in medical video AI. We propose a video-aware framework that measures temporal stability, false positives, and uncertainty beyond frame-level metrics.
Abstract: Frame-level performance metrics (e.g., AUROC, accuracy) can substantially overestimate real-world reliability for med- ical video AI because they ignore temporal consistency and reviewer burden. Using the Kvasir Capsule dataset, we report a complementary evaluation suite capturing: (i) temporal stabil- ity on lesion segments, (ii) false-positive workload on normal segments, and (iii) uncertainty calibration for risk-aware de- ployment. Although the classifier achieves near-ceiling frame- level discrimination (AUROC = 0.999, AUPRC = 0.998, F1 = 0.978), our temporal analysis reveals clinically rele- vant failure modes: severe prediction flicker on some lesion segments (worst TJI = 322.6 flips/1000 frames; worst TDP = 0.19) and concentrated false-positive burden on specific normal segments (worst VRBI = 65.6). Uncertainty analy- sis indicates well-calibrated confidence (ECE = 0.007) and meaningful uncertainty-error coupling (UEC = 0.42), sup- porting human-in-the-loop safety use. Our results show that near-ceiling frame-level performance can coexist with tem- porally localized instability and workload concentration that remain hidden under conventional evaluation metrics.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 6
Loading