Beyond Dice: Risk-Normalized and Hazard-Aware Evaluation of Medical Segmentation for Image-Guided Robotics
Keywords: robotics, safety
Abstract: Safety in embodied systems depends on where perception fails, not only on how often it fails. We introduce two bounded evaluation metrics for segmentation that make spatial risk explicit using an anatomy-derived hazard field built from distance to protected structures. The \emph{Safety Impact Score} (SIS) measures the share of total hazard mass that is misclassified, with a tunable trade-off between false negatives and false positives. The \emph{Safety Tail Risk} (STAR) summarizes the worst fraction of error hazards using a conditional value-at-risk operator. To isolate metric behavior from model quality, we design a model-free matched-Dice stress test that relocates equal numbers of boundary errors toward or away from hazard while keeping Dice unchanged. We run this protocol on three public Medical Segmentation Decathlon tasks (Hepatic Vessel, Liver, Pancreas; five cases each). Across datasets, STAR shows large positive deltas for the risky variant (combined mean $\Delta$STAR $=0.431 \pm 0.124$, paired Wilcoxon $p=3.24{\times}10^{-4}$), and SIS is also positive (combined mean $\Delta$SIS $=0.0818 \pm 0.0771$, $p=3.05{\times}10^{-5}$). Effects are strongest when the hazard corresponds to vessels in Hepatic Vessel (mean $\Delta$STAR $=0.614 \pm 0.307$, mean $\Delta$SIS $=0.224 \pm 0.182$). A proximity-weighted overlap baseline (hazard-weighted Dice) moves little or in the opposite direction. Results persist with an exponential hazard kernel, indicating robustness. These findings demonstrate that risk-normalized and tail-aware evaluation captures safety-relevant differences that overlap metrics miss, using only public data and a simple perturbation protocol.
Submission Number: 13
Loading