\begin{figure*}[ht]
\centering
\includegraphics[width=\linewidth]{figs/RCs-models.pdf}
\caption{A comparison of RC curves made by three models selected in \citep{galil_what_2023}, including examples of highest (ViT-L/16-384) and lowest (EfficientNet-V2-XL) AUROC. An RC curve shows the tradeoff between risk (in this case, error rate) and coverage. The initial risk for any classifier is found at the 100\% coverage point, where all predictions are accepted. Normally, the risk can be reduced by reducing coverage (which is done by increasing the selection threshold); for instance, a 2\% error rate can be obtained at 36.2\% coverage for the ViT-B/32-224-SAM model and at 61.9\% coverage for the ViT-L/16-38 model. However, for the EfficientNet-V2-XL model, this error rate is not achievable at any coverage, since its RC curve is lower bounded by 5\% risk. Moreover, this RC curve is actually non-monotonic, with an increasing risk as coverage is reduced, for low coverage. Fortunately, this apparent pathology in EfficientNet-V2-XL completely disappears after a simple post-hoc tuning of its confidence estimator (without the need to retrain the model), resulting in significantly improved selective classification performance. In particular, a 2\% error rate can then be achieved at 55.3\% coverage.}
\label{fig:RC-Comparison}
\end{figure*}

Consider a machine learning classifier that does not reach the desired performance for the intended application, even after significant development time. This may occur for a variety of reasons: the problem is too hard for the current technology; more development resources (data, compute or time) are needed than what is economically feasible for the specific situation; or perhaps the target distribution is different from the training one, resulting in a performance gap. 
In this case, one is faced with the choice of deploying an underperforming model or not deploying a model at all.

A better tradeoff may be achieved by using so-called selective classification \citep{geifman_selective_2017,El-Yaniv.Wiener.2010.Foundations-Noisefree-Selective}. The idea is to run the model on all inputs but reject predictions for which the model is least confident, hoping to increase the performance on the accepted predictions. The rejected inputs may be processed in the same way as if the model were not deployed, for instance, by a human specialist or by the previously existing system. This offers a tradeoff between performance and \textit{coverage} (the proportion of accepted predictions) which may be a better solution than any of the extremes. In particular, it could shorten the path to adoption of deep learning in safety-critical applications, such as medical diagnosis and autonomous driving, where the consequences of erroneous decisions can be severe \citep{zou_review_2023,neumann2018relaxed}.

A key element in selective classification is the confidence estimator that is thresholded to decide whether a prediction is accepted. In the case of neural networks with softmax outputs, the natural baseline to be used as a confidence estimator is the maximum softmax probability (MSP) produced by the model, also known as the softmax response \citep{geifman_selective_2017,hendrycks_baseline_2018}. Several approaches have been proposed %\footnote{For a more complete account of related work, please see Appendix~\ref{sec:related-work}.} 
attempting to improve upon this baseline, which generally fall into two categories: approaches that require retraining the classifier, by modifying some aspect of the architecture or the training procedure, possibly adding an auxiliary head as the confidence estimator \citep{geifman_selectivenet_2019,liu_deep_2019,huang_self-adaptive_2020}; and post-hoc approaches that do not require retraining, thus only modifying or replacing the confidence estimator based on outputs or intermediate features produced by the model \citep{corbiere_confidence_2021,granese2021doctor,shen_post-hoc_2022,galil_what_2023}. The latter is arguably the most practical scenario, especially if tuning the confidence estimator is sufficiently simple.

In this paper, we focus on the simplest possible class of post-hoc methods, which are those for which the confidence estimator can be computed directly from the network unnormalized \textit{logits} (pre-softmax output). Our main goal is to identify the methods that produce the largest gains in selective classification performance, measured by the area under the risk-coverage curve (AURC); however, as in general these methods can have hyperparameters that need to be tuned on hold-out data, we are also concerned with data efficiency. Our study is motivated by an intriguing problem reported in \citep{galil_what_2023} and illustrated in Fig.~\ref{fig:RC-Comparison}: some state-of-the-art ImageNet classifiers, despite attaining excellent predictive performance, nevertheless exhibit appallingly poor performance at detecting their own mistakes. Can such pathologies be fixed by simple post-hoc methods?

To answer this question, we consider every such method to our knowledge, as well as several variations and novel methods that we propose, and perform an extensive experimental study using 84 pretrained ImageNet classifiers available from popular repositories. Our results show that, among other close contenders, a simple $p$-norm normalization of the logits, followed by taking the maximum logit as the confidence estimator, can lead to considerable gains in selective classification performance, completely fixing the pathological behavior observed in many classifiers, as illustrated in Fig.~\ref{fig:RC-Comparison}. As a consequence, the selective classification performance of any classifier becomes almost entirely determined by its corresponding accuracy.

The main contributions of this work are summarized as follows:
\begin{itemize}
%\item We propose a simple but powerful framework for designing confidence estimators, which involves tunable logit transformations optimized directly for a selective classification metric;
\item We perform an extensive experimental study of many existing and proposed confidence estimators, obtaining considerable gains for most classifiers. In particular, we find that a simple post-hoc estimator can provide up to 62\% reduction in normalized AURC using no more than one sample per class of labeled hold-out data;
\item We show that, after post-hoc optimization, the selective classification performance of any classifier becomes almost entirely determined by its corresponding accuracy, eliminating the seemingly existing tradeoff between these two goals reported in previous work.
\item We also study how these post-hoc methods perform under distribution shift and find that the results remain consistent: a method that provides gains in the in-distribution scenario also provides considerable gains under distribution shift.
%\item We investigate why certain classifiers innately have a good confidence estimator that apparently cannot be improved by post-hoc methods.
\end{itemize}


