\label{experimental_setup}
All experiments\footnote{Our code is available at \url{https://github.com/lfpc/FixSelectiveClassification}.} 
in this section were performed using PyTorch \citep{paszke_pytorch_2019} and all of its provided classifiers pre-trained on ImageNet \citep{deng_imagenet_2009}. Additionally, some models of the \cite{wightman_pytorch_2019} repository were used, particularly the ones highlighted by \cite{galil_what_2023}. In total, 84 ImageNet classifiers were used. The list of all models, together with all the results per model are presented in Appendix \ref{appendix:results_imagenet}. The ImageNet validation set was randomly split into 5000 hold-out images for post-hoc optimization (which we also refer to as the \textit{tuning set}) and 45000 images for performance evaluation (the test set). To ensure that the results are statistically significant, we repeat each experiment (including post-hoc optimization) for 10 different random splits
and report mean and standard deviation.

To give evidence that our results are not specific to ImageNet, we also performed experiments on CIFAR-100 \citep{krizhevsky_learning_2009} and Oxford-IIIT Pet \citep{oxfordpets} datasets, which are presented in the Appendix \ref{appendix:more_datasets}.

\label{section:results}
\subsection{Comparison of Methods}
\label{sec:comparison-methods}


\begin{table*}[h]
\caption{NAURC (mean {\footnotesize $\pm$std}) for post-hoc methods applied to ImageNet classifiers}
\label{tab:results_efficientnet_vgg}
\centering
\begin{tabular}{llcccc}
\toprule
& &\multicolumn{4}{c}{Logit Transformation} \\
\cmidrule(r){3-6}
Classifier & Conf. Estimator &    Raw & TS-NLL &  TS-AURC &  pNorm \\
\midrule \multirow{6}{*}{EfficientNet-V2-XL}
& MSP & 0.4402 {\footnotesize $\pm$0.0032} & 0.3506 {\footnotesize $\pm$0.0039} & 0.1957 {\footnotesize $\pm$0.0027} & 0.1734 {\footnotesize $\pm$0.0030} \\
& SoftmaxMargin & 0.3816 {\footnotesize $\pm$0.0031} & 0.3144 {\footnotesize $\pm$0.0034} & 0.1964 {\footnotesize $\pm$0.0046} & 0.1726 {\footnotesize $\pm$0.0026} \\
& MaxLogit & 0.7680 {\footnotesize $\pm$0.0028} & - & - & \textbf{0.1693} {\footnotesize $\pm$0.0018} \\
& LogitsMargin & 0.1937 {\footnotesize $\pm$0.0023} & - & - & 0.1728 {\footnotesize $\pm$0.0020} \\
& NegativeEntropy & 0.5967 {\footnotesize $\pm$0.0031} & 0.4295 {\footnotesize $\pm$0.0057} & 0.1937 {\footnotesize $\pm$0.0023} & 0.1719 {\footnotesize $\pm$0.0022} \\
& NegativeGini & 0.4486 {\footnotesize $\pm$0.0032} & 0.3517 {\footnotesize $\pm$0.0040} & 0.1957 {\footnotesize $\pm$0.0027} & 0.1732 {\footnotesize $\pm$0.0030} \\
\midrule \multirow{6}{*}{VGG16}
& MSP & \textbf{0.1839} {\footnotesize $\pm$0.0006} & 0.1851 {\footnotesize $\pm$0.0006} & 0.1839 {\footnotesize $\pm$0.0007} & 0.1839 {\footnotesize $\pm$0.0007} \\
& SoftmaxMargin & 0.1900 {\footnotesize $\pm$0.0006} & 0.1892 {\footnotesize $\pm$0.0006} & 0.1888 {\footnotesize $\pm$0.0006} & 0.1888 {\footnotesize $\pm$0.0006} \\
& MaxLogit & 0.3382 {\footnotesize $\pm$0.0009} & - & - & 0.2020 {\footnotesize $\pm$0.0012} \\
& LogitsMargin & 0.2051 {\footnotesize $\pm$0.0005} & - & - & 0.2051 {\footnotesize $\pm$0.0005} \\
& NegativeEntropy & 0.1971 {\footnotesize $\pm$0.0007} & 0.2055 {\footnotesize $\pm$0.0006} & 0.1841 {\footnotesize $\pm$0.0006} & 0.1841 {\footnotesize $\pm$0.0006} \\
& NegativeGini & 0.1857 {\footnotesize $\pm$0.0007} & 0.1889 {\footnotesize $\pm$0.0005} & 0.1840 {\footnotesize $\pm$0.0006} & 0.1840 {\footnotesize $\pm$0.0006} \\
\bottomrule
\end{tabular}
\end{table*}

\begin{table*}[h]
\caption{APG-NAURC (mean {\footnotesize $\pm$std}) of post-hoc methods across 84 ImageNet classifiers}
\label{tab:APG}
\centering
\begin{tabular}{lcccc}
\toprule
&\multicolumn{4}{c}{Logit Transformation} \\
\cmidrule(r){2-5}
Conf. Estimator &    Raw & TS-NLL &  TS-AURC &  pNorm \\
\midrule
MSP & 0.0 {\footnotesize $\pm$ 0.0} & 0.03665 {\footnotesize $\pm$0.00034} & 0.05769 {\footnotesize $\pm$0.00038} & 0.06796 {\footnotesize $\pm$0.00051} \\
SoftmaxMargin & 0.01955 {\footnotesize $\pm$0.00008} & 0.04113 {\footnotesize $\pm$0.00022} & 0.05601 {\footnotesize $\pm$0.00041} & 0.06608 {\footnotesize $\pm$0.00052} \\
MaxLogit & 0.0 {\footnotesize $\pm$ 0.0} & - & - & \textbf{0.06863} {\footnotesize $\pm$0.00045} \\
LogitsMargin & 0.05531 {\footnotesize $\pm$0.00042} & - & - & 0.06204 {\footnotesize $\pm$0.00046} \\
NegativeEntropy & 0.0 {\footnotesize $\pm$ 0.0} & 0.01570 {\footnotesize $\pm$0.00085} & 0.05929 {\footnotesize $\pm$0.00032} & 0.06771 {\footnotesize $\pm$0.00052} \\
NegativeGini & 0.0 {\footnotesize $\pm$ 0.0} & 0.03636 {\footnotesize $\pm$0.00042} & 0.05809 {\footnotesize $\pm$0.00037} & 0.06800 {\footnotesize $\pm$0.00054} \\
\bottomrule
\end{tabular}
\end{table*}


We start by evaluating the NAURC of each possible combination of a confidence estimator listed in Section~\ref{section:confidence-estimation} with a logit transformation described in Section~\ref{sec:logit-transformations}, for specific models. Table~\ref{tab:results_efficientnet_vgg} 
shows the results for EfficientNet-V2-XL (trained on ImageNet-21K and fine tuned on ImageNet-1K) and VGG16, respectively, the former chosen for having the worst confidence estimator performance (in terms of AUROC, with MSP as the confidence estimator) of all the models reported in \citep{galil_what_2023} and the latter chosen as a representative example of a lower accuracy model for which the MSP is already a good confidence estimator.




As can be seen, on EfficientNet-V2-XL, the baseline MSP is easily outperformed by most methods. Surprisingly, the best method is not to use a softmax function but, instead, to take the maximum of a $p$-normalized logit vector, leading to a reduction in NAURC of 0.27 points or about 62\%. 

However, on VGG16, the situation is quite different, as methods that use the unnormalized logits and improve the performance on EfficientNet-V2-XL, such as LogitsMargin and MaxLogit-pNorm, actually degrade it on VGG16. Moreover, the highest improvement obtained, e.g., with MSP-TS-AURC, is so small that it can be considered negligible. (In fact, gains below 0.003 NAURC are visually imperceptible in an AURC curve.) Thus, it is reasonable to assert that none of the post-hoc methods considered is able to outperform the baseline in this case.

In Table~\ref{tab:APG}, we evaluate the average performance of post-hoc methods across all models considered, using the APG-NAURC metric described in Section~\ref{section:fallback}, where we assume $\epsilon=0.01$. Figure~\ref{fig:gains} shows the gains for selected methods for each model, ordered by MaxLogit-pNorm gains. It can be seen that the highest gains are provided by MaxLogit-pNorm, NegativeGini-pNorm, MSP-pNorm and NegativeEntropy-pNorm, and their performance is essentially indistinguishable whenever they provide a non-negligible gain over the baseline. Moreover, the set of models for which significant gains can be obtained appears to be consistent across all methods.



\begin{figure}[h]
\centering
\begin{subfigure}{\linewidth}
    \centering
    \includegraphics[width=\linewidth]{figs/gains_methods_ImageNet.pdf}\\[-1ex]
    \caption{All classifiers}
\end{subfigure}
\begin{subfigure}{\linewidth}
    \centering
    \vspace{1ex}
    \includegraphics[width=\linewidth]{figs/NAURC_gains_methods_ImageNet_zoom.pdf}\\[-1ex]
    \caption{Close up}
\end{subfigure}
    \caption{NAURC gains for post-hoc methods across 84 ImageNet classifiers. Lines indicate the average of 10 random splits and the filled regions indicate $\pm 1$ standard deviation. The black dashed line denotes $\epsilon = 0.01$.}
    \label{fig:gains}
\end{figure}

Although several post-hoc methods provide considerable gains, they all share a practical limitation which is the requirement of hold-out data for hyperparameter tuning. In Appendix~\ref{appendix:data_efficiency}, we study the data efficiency of some of the best performing methods. MaxLogit-pNorm, having a single hyperparameter, emerges as a clear winner, requiring fewer than 500 samples to achieve near-optimal performance on ImageNet ($< 0.5$ images per class on average) and fewer than 100 samples on CIFAR-100 ($< 1$ image per class on average). These requirements are clearly easily satisfied in practice for typical validation set sizes.

Details on the optimization of $T$ and $p$, additional results showing AUROC values and RC curves, and results on the insensitivity of our conclusions to the choice of $\epsilon$ are provided in Appendix~\ref{appendix:more_results}. In addition, the benefits of a tunable versus fixed $p$ and a comparison with other tunable methods that do not fit into the framework of Section~\ref{sec:logit-transformations} are discussed, respectively, in Appendices \ref{appendix:ablation} and \ref{appendix:other-methods}. Finally, an investigation of the calibration performance of some methods can be found in Appendix~\ref{appendix:investigation}.


\subsection{Post-hoc Optimization Fixes Broken Confidence Estimators}

\begin{figure}[h]%[!htb]
\centering
\begin{subfigure}[b]{\linewidth}
    \centering
    \includegraphics[width=0.98\linewidth]{figs/NAURC.pdf}\\[-1ex]
    \caption{NAURC}
    \label{fig:naurc-models}
\end{subfigure}
\begin{subfigure}[b]{\linewidth}
    \centering
    \vspace{1ex}
    \includegraphics[width=0.98\linewidth]{figs/AURC.pdf}\\[-1ex]
    \caption{AURC}
    \label{fig:aurc-models}
\end{subfigure}
\begin{subfigure}[b]{\linewidth}
    \centering
    \vspace{1ex}
    \includegraphics[width=0.98\linewidth]{figs/SAC.pdf}\\[-1ex]
    \caption{Coverage at 98\% selective accuracy}
    \label{fig:sac-models}
\end{subfigure}
\caption{NAURC, AURC and SAC of 84 ImageNet classifiers with respect to their accuracy, before and after post-hoc optimization. The baseline plots use MSP, while the optimized plots use MaxLogit-pNorm. The legend shows the optimal value of $p$ for each model, where MSP indicates MSP fallback (no significant positive gain). $\rho$ is the Spearman correlation between a metric and the accuracy. In (c), models that cannot achieve the desired selective accuracy are shown with $\approx 0$ coverage.}
\label{fig:aurc-naurc-models}
\end{figure}

From Figure~\ref{fig:gains}, we can distinguish two groups of models: those for which the MSP baseline is already the best confidence estimator and those for which post-hoc methods provide considerable gains (particularly, MaxLogit-pNorm). In fact, most models belong to the second group, comprising 58 of the 84 models considered.

Figure~\ref{fig:aurc-naurc-models} illustrates two noteworthy phenomena. First, as previously observed by \citet{galil_what_2023}, certain models exhibit superior accuracy than others but poorer uncertainty estimation, leading to a trade-off when selecting a classifier for selective classification. Second, post-hoc optimization can fix any ``broken'' confidence estimators. This can be seen in two ways: In Figure~\ref{fig:naurc-models}, after optimization, all models exhibit a much more similar level of confidence estimation performance (as measured by NAURC), although a dependency on accuracy is clearly seen (better predictive models are better at predicting their own failures). In Figure~\ref{fig:aurc-models}, it is clear that, after optimization, the selective classification performance of any classifier (measured by AURC) becomes almost entirely determined by its corresponding accuracy. Indeed, the Spearman correlation between AURC and accuracy becomes extremely close to 1. The same conclusions hold for the SAC metric, as shown in Figure~\ref{fig:sac-models}.
This implies that any ``broken'' confidence estimators have been fixed, and consequently, total accuracy becomes the primary determinant of selective performance even at lower coverage levels.



%An intriguing question is what properties of a classifier make it bad at confidence estimation. Experiments investigating this question are presented in Appendix \ref{appendix:investigation}. In summary, our surprising conclusion is that models that produce highly confident MSPs tend to have better confidence estimators (in terms of NAURC), while models whose MSP distribution is more balanced tend to be easily improvable by post-hoc optimization---which, in turn, makes the resulting confidence estimator concentrated on highly confident values.

\subsection{Performance Under Distribution Shift}
\label{section:distshift}

We now turn to the question of how post-hoc methods for selective classification perform under distribution shift. Previous works have shown that calibration can be harmed under distribution shift, especially when certain post-hoc methods---such as TS---are applied \citep{ovadia_can_2019}. To find out whether a similar issue occurs for selective classification, we evaluate selected post-hoc methods on ImageNet-C \citep{hendrycks_benchmarking_2019}, which consists in 15 different corruptions of the ImageNet's validation set, and on ImageNetV2 \citep{recht2019imagenetv2}, which is an independent sampling of the ImageNet test set replicating the original dataset creation process. 
We follow the standard approach for evaluating robustness with these datasets, which is to use them only for inference; thus, the post-hoc methods are optimized using only the 5000 hold-out images from the uncorrupted ImageNet validation dataset. To avoid data leakage, the same split is applied to the ImageNet-C dataset, so that inference is performed only on the 45000 images originally selected as the test set.

\begin{figure*}[h]
\centering
\begin{subfigure}[t]{0.25\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/gains_v2.pdf}
    \caption{}
    \label{fig:gains_v2}
\end{subfigure}
\begin{subfigure}[t]{0.25\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/NAURC_consistency.pdf}
    \caption{}
    \label{fig:naurc_consistency}
\end{subfigure}
\begin{subfigure}[t]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/NAURC_shift.pdf}
    \caption{}
    \label{fig:datashift_acc_naurc_baseline_full}
\end{subfigure}

\caption{(a) NAURC gains (over MSP) on ImageNetV2 versus NAURC gains on the ImageNet test set. (b) NAURC on ImageNetV2 versus NAURC on the ImageNet test set. (c) NAURC versus accuracy for ImageNetV2, ImageNet-C and the IID dataset. All models are optimized using MaxLogit-pNorm (with MSP fallback).}
\end{figure*}

\begin{table*}[h]
\caption{Selective classification performance (achievable coverage for some target selective accuracy; mean {\tiny $\pm$std}) for a ResNet-50 on ImageNet under distribution shift. For ImageNet-C, each entry is the average across all corruption types for a given level of corruption. The target accuracy is the one achieved for corruption level 0.}
\label{tab:results_datashift}
\centering
\begin{tabular}{ccccccccc}
\toprule
&
\multicolumn{2}{c}{} &\multicolumn{5}{c}{Corruption level}\\ 
\cmidrule(r){3-8}
  & Method & 0 & 1 & 2 & 3 & 4 & 5 & V2 \\
 \midrule
 Accuracy[\%] & - & 80.84 & 67.81 \tiny $\pm$0.05 & 58.90 \tiny $\pm$0.04 & 49.77 \tiny $\pm$0.04 & 37.92 \tiny $\pm$0.03 & 26.51 \tiny $\pm$0.03 & 69.77 \tiny $\pm$0.10
 \\ 
 \midrule
 \multirow{3}{4.25em}{Coverage (SAC) [\%]} 
&MSP & 100 & 72.14 \tiny $\pm$0.11 & 52.31 \tiny $\pm$0.13 & 37.44 \tiny $\pm$0.11 & 19.27 \tiny $\pm$0.07 & 8.53 \tiny $\pm$0.12 & 76.24 \tiny $\pm$0.22 \\
&MSP-TS-AURC & 100 & 72.98 \tiny $\pm$0.23 & 55.87 \tiny $\pm$0.27 & 40.89 \tiny $\pm$0.21 & 24.65 \tiny $\pm$0.19 & 12.52 \tiny $\pm$0.05 & 76.22 \tiny $\pm$0.41 \\
&MaxLogit-pNorm & 100 & \textbf{75.24} \tiny $\pm$0.15 & \textbf{58.58} \tiny $\pm$0.27 & \textbf{43.67} \tiny $\pm$0.37 & \textbf{27.03} \tiny $\pm$0.36 & \textbf{14.51} \tiny $\pm$0.26 & \textbf{78.66} \tiny $\pm$0.38 \\
\bottomrule
\end{tabular}
\end{table*}


First, we evaluate the performance of MaxLogit-pNorm on ImageNet and ImageNetV2 for all classifiers considered. Figure~\ref{fig:gains_v2} shows that the NAURC gains (over the MSP baseline) obtained for ImageNet translate to similar gains for ImageNetV2, showing that this post-hoc method is quite robust to distribution shift. Then, considering all models after post-hoc optimization with MaxLogit-pNorm, we investigate whether selective classification performance itself (as measured by NAURC) is robust to distribution shift. As can be seen in Figure~\ref{fig:naurc_consistency}, the results are consistent, following an affine function (with Pearson's correlation equal to 0.983); however, a significant degradation in NAURC can be observed for all models under distribution shift. While at first sight this would suggest a lack of robustness, a closer look reveals that it can actually be explained by the natural accuracy drop of the underlying classifier under distribution shift. Indeed, we have already noticed in Figure~\ref{fig:naurc-models} a negative correlation between the NAURC and the accuracy; in Figure~\ref{fig:datashift_acc_naurc_baseline_full}  these results are expanded by including the evaluation on ImageNetV2 and also (for selected models AlexNet, ResNet50, WideResNet50-2, VGG11, EfficientNet-B3 and ConvNext-Large, sorted by accuracy) on \mbox{ImageNet-C}, where we can see that the strong correlation between NAURC and accuracy continues to hold. 



Finally, to give a more tangible illustration of the impact of selective classification, Table~\ref{tab:results_datashift} shows the SAC metric for a ResNet50 under distribution shift, with the target accuracy as the original accuracy obtained with the in-distribution test data. As can be seen, the original accuracy can be restored at the expense of coverage; meanwhile, MaxLogit-pNorm achieves higher coverages for all distribution shifts considered, significantly improving coverage over the MSP baseline.

