Keywords: model calibration, post-hoc calibration, calibration benchmark
Abstract: Reliable uncertainty calibration is crucial for safe deployment of deep neural networks in high-stakes settings. While these networks are known to exhibit systematic overconfidence, particularly under distribution shifts, the calibration of large-scale vision models, such as ConvNeXt, EVA, and BEiT, remains underexplored. We comprehensively examine their calibration behavior, uncovering findings that challenge well-established assumptions. We find that these models are underconfident on in-distribution data, resulting in increased calibration error, but exhibit improved calibration under distribution shifts. This phenomenon is primarily driven by modern training techniques, including massive pretraining and sophisticated regularization and augmentation methods, rather than architectural innovations alone. We also demonstrate that these large-scale models are highly responsive to post-hoc calibration techniques in the in-distribution setting, enabling practitioners to mitigate underconfidence bias effectively. However, these methods become progressively less reliable under severe distribution shifts and can occasionally produce counterproductive results. Our findings highlight the complex, non-monotonic effects of architectural and training innovations on calibration, challenging established narratives of continuous improvement.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 18027
Loading