Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration

TMLR Paper3224 Authors

21 Aug 2024 (modified: 31 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Most machine learning classifiers are designed to output posterior probabilities for the classes given the input sample. These probabilities may be used to make the categorical decision on the class of the sample; provided as input to a downstream system; or provided to a human for interpretation. Evaluating the quality of the posteriors generated by these system is an essential problem which was addressed decades ago with the invention of proper scoring rules (PSRs). Unfortunately, much of the recent machine learning literature uses calibration metrics---most commonly, the expected calibration error (ECE)---as a proxy to assess posterior performance. The problem with this approach is that calibration metrics reflect only one aspect of the quality of the posteriors, ignoring the discrimination performance. For this reason, we argue that calibration metrics should play no role in the assessment of posterior quality. Expected PSRs should instead be used for this job, preferably normalized for ease of interpretation. In this work, we first give a brief review of PSRs from a practical perspective, motivating their definition using Bayes decision theory. We discuss why expected PSRs provide a principled measure of the quality of a system's posteriors and why calibration metrics are not the right tool for this job. We argue that calibration metrics, while not useful for performance assessment, may be used as diagnostic tools during system development. With this purpose in mind, we discuss a simple and practical calibration metric, called calibration loss, derived from a decomposition of expected PSRs. We compare this metric with the ECE and with the expected score divergence calibration metric from the PSR literature and argue, using theoretical and empirical evidence, that calibration loss is superior to these two metrics.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: This revision addresses the comments and suggestions from the three reviewers as follows: * Added a comment in the introduction to make explicit our fundamental assumption that decisions should be made rationally and that rational decisions are made with Bayes decision theory. * Added further citations in the introduction on works that use calibration metrics to evaluate the quality of posteriors. * Added a paragraph in Section 2.4 to describe the analogy between calibration quality and training set size. * Added three figures in the theoretical section to illustrate the different posteriors in the equations, the procedure needed to compute calibration loss, and the three PSRs we discuss. * Added results on three medical imaging datasets from the MedMNIST corpus. Discarded results from the CIFAR100 set since they were not essential for the discussions and the table had become too large. * Introduced colors in Table 2 to facilitate visualization of the main points highlighted in the text. * Increased the size of the figures in the experimental section. * Moved the dataset descriptions from the appendix into Section 3.5 of the main body since they were already quite brief. The changes are highlighted in red in the document. We look forward to any further comments or suggestions from the reviewers.
Assigned Action Editor: ~Takashi_Ishida1
Submission Number: 3224
Loading