On Temperature Scaling and Conformal Prediction of Deep Classifiers

Lahav Dabah; Tom Tirer

On Temperature Scaling and Conformal Prediction of Deep Classifiers

Lahav Dabah, Tom Tirer

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We theoretically and empirically analyze the impact of temperature scaling beyond its usual calibration role on key conformal prediction methods.

Abstract: In many classification applications, the prediction of a deep neural network (DNN) based classifier needs to be accompanied by some confidence indication. Two popular approaches for that aim are: 1) *Calibration*: modifies the classifier's softmax values such that the maximal value better estimates the correctness probability; and 2) *Conformal Prediction* (CP): produces a prediction set of candidate labels that contains the true label with a user-specified probability, guaranteeing marginal coverage but not, e.g., per class coverage. In practice, both types of indications are desirable, yet, so far the interplay between them has not been investigated. Focusing on the ubiquitous *Temperature Scaling* (TS) calibration, we start this paper with an extensive empirical study of its effect on prominent CP methods. We show that while TS calibration improves the class-conditional coverage of adaptive CP methods, surprisingly, it negatively affects their prediction set sizes. Motivated by this behavior, we explore the effect of TS on CP *beyond its calibration application* and reveal an intriguing trend under which it allows to trade prediction set size and conditional coverage of adaptive CP methods. Then, we establish a mathematical theory that explains the entire non-monotonic trend. Finally, based on our experiments and theory, we offer guidelines for practitioners to effectively combine adaptive CP with calibration, aligned with user-defined goals.

Lay Summary: Modern machine learning classifiers often output a "confidence" score alongside their predictions. While this is meant to indicate how likely the prediction is to be correct, these scores often do not reflect true likelihoods. This miscalibration can be dangerous in high-stakes applications, where decisions based on false confidence can have serious consequences. To address this, two key techniques have emerged: *calibration*, which adjusts confidence scores to better align with actual correctness probabilities, and *conformal prediction* (CP), which produces a set of possible classes guaranteed to contain the true label with a specified probability. However, the interplay between these methods remains largely unexplored. This paper addresses that gap: What happens when you apply calibration (specifically, temperature scaling) before using conformal prediction? Surprisingly, while temperature scaling improves coverage performance across different categories (e.g., image classes), it can also make the list of possible answers longer and less useful. We further experimented with different temperature values and found that this trade-off still holds. In addition, we developed a mathematical theory that explains the effect of temperature scaling on the prediction set sizes produced by CP methods. Finally, we distilled these insights into actionable guidelines to help practitioners—particularly in high-stakes domains—better tune CP methods for more reliable and informative AI predictions.

Link To Code: https://github.com/lahavdabah/TS4CP

Primary Area: Deep Learning->Everything Else

Keywords: classification, temperature scaling, conformal prediction, conditional coverage, prediction sets

Submission Number: 7664

Loading