Revisiting Generalization Measures Beyond IID: How Image Corruption and Perturbation Affect Robustness of Generalization Measures

TMLR Paper9838 Authors

18 Jun 2026 (modified: 20 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Predicting generalization from quantities available before target-test evaluation remains a central challenge in deep learning. The systematic benchmark of Jiang et al. (2020) evaluated many generalization measures, but it focused on independent and identically distributed (IID) settings. We revisit this problem for image classifiers evaluated under controlled corruptions and perturbations. Our study uses CIFAR-10-C/P, where the label space and task remain fixed while the input images are degraded or perturbed. This setting also allows us to revisit the robustness concerns raised by Dziugaite et al. (2020), who showed that the apparent reliability of generalization measures can depend strongly on experimental conditions. Our experiments show that the usefulness of generalization measures is strongly regime-dependent. When the selector must commit to a measure family before target-test evaluation, Calibration & Confidence provides the most favorable family-level downside profile in our CIFAR-10-C/P protocol, achieving the lowest normalized-regret point estimate and the highest top-20% hit rate among non-oracle families. Optimization-based measures, Information Criteria, and Sharpness-based measures provide additional regime-dependent signals in correlation or local-reliability analyses. Together, these findings suggest that model selection should not rely only on measures favored by IID evaluation. Instead, generalization measures should be treated as regime-dependent ranking signals and validated on a target-like proxy for the expected corruption or perturbation setting.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Adams_Wai-Kin_Kong1
Submission Number: 9838
Loading