BEYOND AUROC: EVALUATING TEMPORAL STABILITY, FALSE-POSITIVE LOAD, AND UNCERTAINTY CALIBRATION IN CAPSULE ENDOSCOPY VIDEO AI
Keywords: Capsule Endoscopy, Medical Video Analysis, Temporal Stability, False-Positive Analysis, Uncertainty Quantification Reliability Evaluation
TL;DR: Near-ceiling AUROC can hide failures in medical video AI. We propose a video-aware framework that measures temporal stability, false positives, and uncertainty beyond frame-level metrics.
Abstract: Frame-level performance metrics (e.g., AUROC, accuracy)
can substantially overestimate real-world reliability for med-
ical video AI because they ignore temporal consistency and
reviewer burden. Using the Kvasir Capsule dataset, we report a
complementary evaluation suite capturing: (i) temporal stabil-
ity on lesion segments, (ii) false-positive workload on normal
segments, and (iii) uncertainty calibration for risk-aware de-
ployment. Although the classifier achieves near-ceiling frame-
level discrimination (AUROC = 0.999, AUPRC = 0.998,
F1 = 0.978), our temporal analysis reveals clinically rele-
vant failure modes: severe prediction flicker on some lesion
segments (worst TJI = 322.6 flips/1000 frames; worst TDP
= 0.19) and concentrated false-positive burden on specific
normal segments (worst VRBI = 65.6). Uncertainty analy-
sis indicates well-calibrated confidence (ECE = 0.007) and
meaningful uncertainty-error coupling (UEC = 0.42), sup-
porting human-in-the-loop safety use. Our results show that
near-ceiling frame-level performance can coexist with tem-
porally localized instability and workload concentration that
remain hidden under conventional evaluation metrics.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 6
Loading