On the usefulness of the fit-on-test view on evaluating calibration of classifiers

Published: 01 Jan 2025, Last Modified: 15 May 2025Mach. Learn. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Calibrated uncertainty estimates are essential for classifiers used in safety-critical applications. If a classifier is uncalibrated, then there is a unique way to calibrate its uncertainty using the idealistic true calibration map corresponding to this classifier. Although the true calibration map is typically unknown in practice, it can be estimated with many post-hoc calibration methods which fit some family of potential calibration functions on a validation dataset. This paper examines the connection between such post-hoc calibration methods and calibration evaluation. Despite the negative connotations of fitting on test data in machine learning, we claim that fitting calibration maps on test data as part of the calibration evaluation process is a method worth considering, and we refer to this view as fit-on-test. This view enables the usage of any post-hoc calibration method as an evaluation measure, unlocking missed opportunities in development of evaluation methods. We prove that even ECE, which is the most common calibration evaluation method, is actually a fit-on-test measure. This observation leads us to a new method of tuning the number of bins in ECE with cross-validation. Fitting on test data can lead to test-time overfitting, and therefore, we discuss the limitations and concerns with the fit-on-test view. Our contributions also include: (1) enhancement of reliability diagrams with diagonal filling; (2) development of new calibration map families PL and PL3; and (3) an experimental study of which families perform strongly both as post-hoc calibrators and calibration evaluators.
Loading