Scalable Utility-Aware Multiclass Calibration

10 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Calibration, Uncertainty Quantification
TL;DR: In this paper, we develop metrics and methodology for scalable utility-aware assessment of multiclass calibration
Abstract: Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. We instead propose \textit{utility calibration}, a general framework designed to evaluate model calibration directly through the lens of downstream applications. This approach measures the calibration error relative to a specific \textit{utility function} that encapsulates the goals or decision criteria relevant to the end user. As such, utility calibration provides a task-specific perspective on reliability. We demonstrate how this framework can \textit{unify and re-interpret several existing calibration metrics}, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and to go beyond such binarized approaches, towards assessing calibration for richer classes of downstream utilities.
Supplementary Material: zip
Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)
Submission Number: 17805
Loading