Keywords: Out-of-Distribution (OOD) Detection, Confidence Calibration, Expected Calibration Error (ECE), Distribution Shift / Dataset Shift
TL;DR: Standard empirical ECE masks catastrophic OOD errors. Bounding worst-case ECE reveals an architectural divide for safety: CNNs need full-distribution entropy scores, whereas fine-tuned ViTs require explicit boundary tracking.
Abstract: Reliable deployment of deep neural networks requires that predictive confidence reflects true correctness likelihood, yet models routinely produce silent failures, confident but incorrect predictions on out-of-distribution (OOD) data. Existing OOD detection scores lack probabilistic semantics, and the standard remedy of post-hoc calibration followed by empirical Expected Calibration Error (ECE) evaluation is insufficient: average-based binning attenuates rare but critical high-confidence errors on OOD inputs. We derive upper bounds on $L_1$ and $L_2$ ECE parameterized by the OOD contamination ratio $\alpha$, providing a worst-case safety envelope for mixed deployment environments. The $L_2$ bound, grounded in the Brier score decomposition, explicitly penalizes high-magnitude confidence deviations that $L_1$ averaging obscures. Applying these bounds with out-of-fold calibration selection across a large-scale study of 20+ scoring functions, we demonstrate that methods ranking as optimal under empirical ECE are pruned under worst-case evaluation, exposing a pronounced architectural dichotomy: for convolutional networks trained from scratch, full-distribution entropy scores (Guessing Entropy, Predictive Entropy) yield the tightest safety guarantees, whereas for fine-tuned Vision Transformers, explicit boundary-distance methods (fDBD) dominate due to non-collapsed feature geometries inherited from pretraining.
Submission Number: 28
Loading