Keywords: model calibration, post-hoc calibration, calibration benchmark
Abstract: Reliable uncertainty calibration is crucial for the safe deployment of deep neural networks in high-stakes settings. While these networks are known to exhibit systematic overconfidence, especially under distribution shifts, the calibration of large-scale vision models, such as ConvNeXt, EVA, and BEiT, has remained underexplored. We comprehensively examine their calibration behavior, uncovering evidence that challenges well-established assumptions. We find that these models are underconfident on in-distribution data, which results in increased calibration error, yet exhibit improved calibration under distribution shifts. This phenomenon is primarily driven by modern training techniques, including massive pretraining and sophisticated regularization and augmentation methods, rather than architectural innovations alone. We also demonstrate that these large-scale models are highly responsive to post-hoc calibration techniques in the in-distribution setting, enabling practitioners to mitigate underconfidence bias effectively. However, these methods become progressively less reliable under severe distribution shifts and can occasionally produce counterproductive effects. Our findings highlight the complex, non-monotonic effects of architectural and training innovations on calibration, challenging established narratives of continuous improvement.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 18027
Loading