\section{On the \textsc{Doctor} method}
\label{sec:doctor}

The paper by \citep{granese2021doctor} introduces a selection mechanism named \textsc{Doctor}, which actually refers to two distinct methods, $D_\alpha$ and $D_\beta$, in two possible scenarios, Total Black Box and Partial Black Box. Only the former scenario corresponds to post-hoc estimators and, in this case, the two methods are equivalent to NegativeGini and MSP, respectively.

To see this, first consider the definition of $D_\alpha$: a sample $x$ is rejected if $1 - \hat{g}(x) > \gamma \hat{g}(x)$, where
\[
1 - \hat{g}(x) = \sum_{k \in \calY} (\sigma(\bz))_k (1 - (\sigma(\bz))_k) = 1 - \sum_{k \in \calY} (\sigma(\bz))_k^2 = 1 - \|\sigma(\bz)\|_2^2
\]
is exactly the Gini index of diversity applied to the softmax outputs. Thus, a sample $x$ is accepted if $1 - \hat{g}(x) \leq \gamma \hat{g}(x) \iff (1 + \gamma) \hat{g}(x) \geq 1 \iff \hat{g}(x) \geq 1/(1 + \gamma) \iff \hat{g}(x) - 1  \geq 1/(1 + \gamma) - 1$. Therefore, the method is equivalent to the confidence estimator $g(x) = \hat{g}(x) - 1 = \|\sigma(\bz)\|^2 - 1$, with $t = 1/(1+\gamma) - 1$ as the selection threshold. 

Now, consider the definition of $D_\beta$: a sample $x$ is rejected if $\hat{P_e}(x) > \gamma (1 - \hat{P_e}(x))$, where $\hat{P_e}(x) = 1 - (\sigma(\bz))_{\hat{y}}$ and $\hat{y} = \argmax_{k \in \calY} (\sigma(\bz))_k$, i.e., $\hat{P_e}(x) = 1 - \text{MSP}(\bz)$. Thus, a sample $x$ is accepted if $\hat{P_e}(x) \leq \gamma (1 - \hat{P_e}(x)) \iff (1 + \gamma) \hat{P_e}(x) \leq \gamma \iff \hat{P_e}(x) \leq \gamma/(1 + \gamma) \iff \text{MSP}(\bz) \geq 1 - \gamma/(1 + \gamma) = 1/(1+\gamma)$. Therefore, the method is equivalent to the confidence estimator $g(x) = \text{MSP}(\bz)$, with $t = 1/(1+\gamma)$ as the selection threshold.

Given the above results, one may wonder why the results in \citep{granese2021doctor} show different performance values for $D_\beta$ and MSP (softmax response), as shown, for instance, in Table~1 in \citet{granese2021doctor}. We suspect this discrepancy is due to numerical imprecision in the computation of the ROC curve for a limited number of threshold values, as the authors themselves point out on their Appendix C.3, combined with the fact that $D_\beta$ and MSP in \citep{granese2021doctor} use different parametrizations for the threshold values. In contrast, we use the implementation from the scikit-learn library (adapting it as necessary for the RC curve), which considers every possible threshold for the confidence values given and so is immune to this kind of imprecision.


\section{On Logit Normalization}
\label{appendix:logit-norm}

\textbf{Logit normalization during training.} \citet{wei_mitigating_2022} argued that, as training progresses, a model may tend to become overconfident on correctly classified training samples by increasing $\|\bz\|_2$. This is due to the fact that the predicted class depends only on $\tilde{\bz} = \bz/\|\bz\|_2$, but the training loss on correctly classified training samples can still be decreased by increasing $\|\bz\|_2$ while keeping $\tilde{\bz}$ fixed. Thus, the model would become overconfident on those samples, since increasing $\|\bz\|_2$ also increases the confidence (as measured by MSP) of the predicted class. This overconfidence phenomenon was confirmed experimentally in \citep{wei_mitigating_2022} by observing that the average magnitude of logits (and therefore also their average 2-norm) tends to increase during training. For this reason, \citet{wei_mitigating_2022} proposed logit 2-norm normalization during training, as a way to mitigate overconfidence. However, during inference, they still used the raw MSP as confidence estimator, without any normalization.

\textbf{Post-training logit normalization.} Here, we propose to use logit $p$-norm normalization as a post-hoc method and we intuitively expected it to have a similar effect in combating overconfidence. (Note that the argument in \citep{wei_mitigating_2022} holds unchanged for any $p$, as nothing in their analysis requires $p = 2$.) Our initial hypothesis was the following: if the model has become too overconfident (through high logit norm) on certain input regions, then---since overconfidence is a form of (loss) overfitting---there would be an increased chance that the model will produce incorrect predictions on the test set along these input regions. Thus, high logit norm on the test set would indicate regions of higher inaccuracy, so that, by applying logit normalization, we would be penalizing likely inaccurate predictions, improving selective classification performance. However, this hypothesis was \textit{disproved} by the experimental results in Appendix~\ref{appendix:data_efficiency}, which show that overconfidence is \textit{not} necessarily a problem for selective classification, but \textit{underconfidence} may be.

Nevertheless, it should be clear that, despite their similarities, logit L2 normalization during training and post-hoc logit $p$-norm normalization are different techniques applied to different problems and with different behavior. Moreover, even if logit normalization during training turns out to be beneficial to selective classification (an evaluation that is, however, outside the scope of this work), it should be emphasized that post-hoc optimization can be easily applied on top of any trained model without requiring modifications to its training regime.

\textbf{Combating underconfidence with temperature scaling.} If a model is underconfident on a set of samples, with low logit norm and an MSP value smaller than its expected accuracy on these samples, then the MSP may not provide a good estimate of confidence. 
One particular case of underconfidence is when the model incorrectly attributes too much posterior probability mass to the least probable classes (e.g., when all classes except the predicted one have the same probability). In this case, LogitsMargin (the margin between the highest and the second highest logit), which effectively disregards all logits except the highest two, may be a better confidence estimator.
Alternatively, one may use MSP-TS with a low temperature, which approximates LogitsMargin, as can be easily seen below. Let $\bz = (z_1, \ldots, z_C)$, with $z_1 > \ldots > z_C$. Then
\begin{align}
\text{MSP}(\bz/T) &= \frac{e^{z_1/T}}{\sum_j e^{z_j/T}} = \frac{1}{1 + e^{(z_2 - z_1)/T} + \sum_{j>2} e^{(z_j - z_1)/T}} \label{eq:msp-ts} \\
&= \frac{1}{1 + e^{-(z_1 - z_2)/T}\left(1 + \sum_{j>2} e^{-(z_2 - z_j)/T} \right)} \approx \frac{1}{1 + e^{-(z_1 - z_2)/T}}
\end{align}
for small $T>0$. Note that a strictly increasing transformation does not change the ordering of confidence values and thus maintains selective classification performance. This helps explain why TS (with $T < 1$) can improve selective classification performance, as already observed in \citep{galil_what_2023}.

\textbf{Logit $p$-norm normalization as temperature scaling.} To shed light on why post-hoc logit $p$-norm normalization (with a general $p$) may be helpful, we can show that it is closely related to MSP-TS. Let $g_p(\bz) = z_1/\|\bz\|_p$ denote MaxLogit-pNorm without centralization, which we denote here as MaxLogit-pNorm-NC. Then
\begin{equation}
\text{MSP}(\bz/T) = \left(\frac{e^{z_1}}{\left(\sum_j e^{z_j/T}\right)^T}\right)^{1/T} = \left(\frac{e^{z_1}}{\|e^{\bz}\|_{1/T}}\right)^{1/T} = \left(g_{1/T}(e^{\bz})\right)^{1/T}.
\end{equation}
Thus, MSP-TS is equivalent to MaxLogit-pNorm-NC with $p=1/T$ applied to the transformed logit vector $\exp(\bz)$. This helps explain why a general $p$-norm normalization is useful, as it is closely related to TS, emphasizing the largest components of the logit vector. This also implies that any benefits of MaxLogit-pNorm-NC over MSP-TS stem from \textit{not} applying the exponential transformation of logits. %Why this happens to be useful is still elusive at this point.

\textbf{Logit $p$-norm normalization goes beyond temperature scaling in combatting underconfidence.}
To understand why not applying this exponential transformation is beneficial,
%To further investigate the advantages of MaxLogit-pNorm over MSP-TS, we will first examine the role of logit centralization in the former method. As demonstrated in \autoref{appendix:centralization}, models that benefit from centralization are those producing logits with a mean significantly different from zero. To understand this phenomenon, 
we first express MaxLogit-pNorm-NC as
\begin{equation}
\label{eq:maxlogit-pnorm-nc}
\text{MaxLogit-pNorm-NC}(\mathbf{z}) = \frac{z_1}{\left(\sum_{j=1}^C |z_j|^p\right)^{1/p}} \
= \frac{1}{\left(\sum_{j=1}^C \left|\frac{z_j}{z_1}\right|^p\right)^{1/p}} \
= \left(\frac{1}{1+\left|\frac{z_2}{z_1}\right|^p+\sum_{j=3}^C \left|\frac{z_j}{z_1}\right|^p}\right)^{1/p}
\end{equation}
where we assume $z_1 > 0$. Now, suppose that the logits already happen to be centralized (which also ensures $z_1 > 0$). It follows that most of the logits $z_j$ for $j \gg 1$ are close to zero (except possibly the very last ones). Thus, under the summation in \eqref{eq:maxlogit-pnorm-nc}, these logits effectively disappear, which is particularly useful in the case of underconfidence discussed above. However, this would not happen if an exponential transformation were applied to the logits as in \eqref{eq:msp-ts}, unless the $T$ is very small. On the other hand, making $T$ too small can lead to ignoring not only the smallest logits but also some of the larger ones as well, i.e., it may be too drastic a measure. These effects are illustrated in Fig.~\ref{fig:fj/f2}.

This analysis also helps explain why centralization is useful. As shown in \autoref{appendix:centralization}, for most models, the logits are already centralized, so MaxLogit-pNorm-NC already provides the highest gains. A few models, however, have logits with means significantly different from zero and precisely these models achieve significant gains when centralization is applied, which enables the above analysis to hold.

In summary, underconfidence can be mitigated by prioritizing the largest logits. This is done MaxLogit-pNorm by increasing $p$ (which is akin to lowering the temperature), by making most of the smallest logits close to zero via centralization (if needed), and by \textit{not} using an exponential transformation, which allows these near-zero logits to be effectively discarded without penalizing largest logits.



\begin{figure}[!htb]
\centering
\begin{subfigure}[h]{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{figs/fj_f2_20_efficientnet_b3.pdf}
\caption{Largest logits}
\end{subfigure}\hfill
\begin{subfigure}[h]{0.5\textwidth}
\centering
\includegraphics[width=0.98\textwidth]{figs/fj_f2_800_efficientnet_b3.pdf}
\caption{Smallest logits}
\end{subfigure}

\caption{The ratio $A_j / A_2$, where $A_j = \exp(z_j/T)$ for MSP-TS and $A_j = |z_j-\mu(\bz)|^p$ for MaxLogit-pNorm, hence reflecting the influence of intermediate logits on Equations \ref{eq:msp-ts} and \ref{eq:maxlogit-pnorm-nc}. The classifier is EfficientNet-B3 evaluated on ImageNet. The sum $\sum_{j \geq 100} A_j / A_2$ is equal to 2.020 for the MSP, 0.005 for the MSP-TS and 0.024 for the MaxLogit-pNorm, showing the effectiveness of the latter two methods in discarding the smallest logits.}
 \label{fig:fj/f2}
\end{figure}



\section{More details and results on the experiments on ImageNet}
\label{appendix:more_results}

\subsection{Hyperparameter optimization of post-hoc methods}

Because it is not differentiable, the NAURC metric demands a zero-order optimization. For this work, the optimizations of $p$ and $T$ were conducted via grid-search. Note that, as $p$ approaches infinity, $||\bz||_p \to \max(|\bz|)$. Indeed, it tends to converge reasonable quickly. Thus, the grid search on $p$ can be made only for small $p$. In our experiments, we noticed that it suffices to evaluate a few values of $p$, such as the integers between 0 and 10, where the 0-norm is taken here to mean the sum of all nonzero values of the vector. The temperature values were taken from the range between 0.01 and 3, with a step size of 0.01, as this showed to be sufficient for achieving the optimal temperature for selective classification (in general between 0 and 1).


\subsection{AUROC results}
Table~\ref{tab:results_efficientnet_vgg_auroc} shows the AUROC results for all methods for an EfficientNetV2-XL and a VGG-16 on ImageNet, and \autoref{fig:auroc-models} shows the correlation between the AUROC and the accuracy. As it can be seen, the results are consistent with the ones for NAURC presented in Section~\ref{section:results}.

\begin{table*}[h]
\caption{AUROC (mean {\footnotesize $\pm$std}) for post-hoc methods applied to ImageNet classifiers
}
\label{tab:results_efficientnet_vgg_auroc}
\centering
\begin{tabular}{llcccc}
\toprule
& &\multicolumn{4}{c}{Logit Transformation} \\
\cmidrule(r){3-6}
Classifier & Conf. Estimator &    Raw & TS-NLL &  TS-AURC &  pNorm \\
\midrule \multirow{6}{*}{EfficientNet-V2-XL}
& MSP & 0.7732 {\footnotesize $\pm$0.0014} & 0.8107 {\footnotesize $\pm$0.0016} & 0.8606 {\footnotesize $\pm$0.0011} & 0.8712 {\footnotesize $\pm$0.0012} \\
& SoftmaxMargin & 0.7990 {\footnotesize $\pm$0.0013} & 0.8245 {\footnotesize $\pm$0.0014} & 0.8603 {\footnotesize $\pm$0.0012} & 0.8712 {\footnotesize $\pm$0.0011} \\
& MaxLogit & 0.6346 {\footnotesize $\pm$0.0014} & - & - & \textbf{0.8740} {\footnotesize $\pm$0.0010} \\
& LogitsMargin & 0.8604 {\footnotesize $\pm$0.0011} & - & - & 0.8702 {\footnotesize $\pm$0.0010} \\
& NegativeEntropy & 0.6890 {\footnotesize $\pm$0.0014} & 0.7704 {\footnotesize $\pm$0.0026} & 0.6829 {\footnotesize $\pm$0.0891} & 0.8719 {\footnotesize $\pm$0.0016} \\
& NegativeGini & 0.7668 {\footnotesize $\pm$0.0014} & 0.8099 {\footnotesize $\pm$0.0017} & 0.8606 {\footnotesize $\pm$0.0011} & 0.8714 {\footnotesize $\pm$0.0012} \\
\midrule \multirow{6}{*}{VGG16}
& MSP & 0.8660 {\footnotesize $\pm$0.0004} & 0.8652 {\footnotesize $\pm$0.0003} & 0.8661 {\footnotesize $\pm$0.0004} & 0.8661 {\footnotesize $\pm$0.0004} \\
& SoftmaxMargin & 0.8602 {\footnotesize $\pm$0.0003} & 0.8609 {\footnotesize $\pm$0.0004} & 0.8616 {\footnotesize $\pm$0.0003} & 0.8616 {\footnotesize $\pm$0.0003} \\
& MaxLogit & 0.7883 {\footnotesize $\pm$0.0004} & - & - & 0.8552 {\footnotesize $\pm$0.0007} \\
& LogitsMargin & 0.8476 {\footnotesize $\pm$0.0003} & - & - & 0.8476 {\footnotesize $\pm$0.0003} \\
& NegativeEntropy & 0.8555 {\footnotesize $\pm$0.0004} & 0.8493 {\footnotesize $\pm$0.0004} & 0.8657 {\footnotesize $\pm$0.0004} & 0.8657 {\footnotesize $\pm$0.0004} \\
& NegativeGini & 0.8645 {\footnotesize $\pm$0.0004} & 0.8620 {\footnotesize $\pm$0.0003} & 0.8659 {\footnotesize $\pm$0.0003} & 0.8659 {\footnotesize $\pm$0.0003} \\
\bottomrule
\end{tabular}
\end{table*}


\begin{figure}[h]
\centering
\includegraphics[width=0.98\linewidth]{figs/AUROC.pdf}\\[-1ex]

\caption{AUROC of 84 ImageNet classifiers with respect to their accuracy, before and after post-hoc optimization. The baseline plots use MSP, while the optimized plots use MaxLogit-pNorm. The legend shows the optimal value of $p$ for each model, where MSP indicates MSP fallback (no significant positive gain). $\rho$ is the Spearman correlation between the AUROC and the accuracy.}
\label{fig:auroc-models}
\end{figure}
 




\subsection{RC curves}

In Figure~\ref{fig:RC_imagenet} the RC curves of selected post-hoc methods applied to a few representative models are shown.


\begin{figure}[!htb]
\centering

\begin{subfigure}[b]{0.7\textwidth}
\centering
\includegraphics[width=\textwidth]{figs/RC-efficientnetv2_xl-ImageNet.pdf}
\caption{EfficientNetV2-XL}
\end{subfigure}
\bigskip

\begin{subfigure}[b]{0.7\textwidth}
\centering
\includegraphics[width=\textwidth]{figs/RC-wide_resnet50_2-ImageNet.pdf}
\caption{WideResNet50-2}
\end{subfigure}
\bigskip

\begin{subfigure}[b]{0.7\textwidth}
\centering
\includegraphics[width=\textwidth]{figs/RC-vgg16-ImageNet.pdf}
\caption{VGG16}
\end{subfigure}

\caption{RC curves for selected post-hoc methods applied to ImageNet classifiers.}
 \label{fig:RC_imagenet}
\end{figure}

\subsection{Effect of \texorpdfstring{$\epsilon$}{epsilon}}
\label{appendix:epsilon}

Figure ~\ref{fig:epsilon} shows the results (in APG metric) for all methods when $p$ is optimized. As can be seen, MaxLogit-pNorm is dominant for all $\epsilon > 0$, indicating that, provided the MSP fallback described in Section~\ref{section:fallback} is enabled, it outperforms the other methods.  
\begin{figure}
    \centering
    \includegraphics[width=0.6\textwidth]{figs/epsilon.pdf}
    \caption{APG as a function of $\epsilon$}
    \label{fig:epsilon}
\end{figure}


\section{Experiments on additional datasets}
\label{appendix:more_datasets}

\subsection{Experiments on Oxford-IIIT Pet}
\label{appendix:oxford_training}
The hold-out set for Oxford-IIIT Pet, consisting of 500 samples, was taken from the training set before training.
The model used was an EfficientNet-V2-XL pretrained on ImageNet from \citet{wightman_pytorch_2019}. It was fine-tuned on Oxford-IIIT Pet \citep{oxfordpets}. The training was conducted for 100 epochs with Cross Entropy Loss, using a SGD optimizer with initial learning rate of 0.1 and a Cosine Annealing learning rate schedule with period 100. Moreover, a weight decay of 0.0005 and a Nesterov's momentum of 0.9 were used. Data transformations were applied, specifically standardization, random crop (for size 224x224) and random horizontal flip.

Figure \ref{fig:efficientnetv2_xl-OxfordIIITPet} shows the RC curves for some selected methods for the EfficientNet-V2-XL. As can be seen, considerable gains are obtained with the optimization of $p$, especially in the low-risk region.
\begin{figure}[h]
    \centering
    \includegraphics[width=0.6\textwidth]{figs/RC-efficientnetv2_xl-OxfordIIITPet.pdf}
    \caption{RC curves for a EfficientNet-V2-XL for Oxford-IIIT Pet}
    \label{fig:efficientnetv2_xl-OxfordIIITPet}
\end{figure}

\subsection{Experiments on CIFAR-100}
\label{appendix:cifar_training}
The hold-out set for CIFAR-100, consisting of 5000 samples, was taken from the training set before training.
The model used was forked from \url{github.com/kuangliu/pytorch-cifar}, and adapted for CIFAR-100 \citep{krizhevsky_learning_2009}. It was trained for 200 epochs with Cross Entropy Loss, using a SGD optimizer with initial learning rate of 0.1 and a Cosine Annealing learning rate schedule with period 200. Moreover, a weight decay of 0.0005 and a Nesterov's momentum of 0.9 were used. Data transformations were applied, specifically standardization, random crop (for size 32x32 with padding 4) and random horizontal flip.

Figure \ref{fig:vgg_19_cifar} shows the RC curves for some selected methods for a VGG19. As it can be seen, the results follow the same pattern of the ones observed for ImageNet, with MaxLogit-pNorm achieving the best results.
\begin{figure}
    \centering
    \includegraphics[width=0.6\textwidth]{figs/RC-VGG_19-Cifar100.pdf}
    \caption{RC curves for a VGG19 for CIFAR-100}
    \label{fig:vgg_19_cifar}
\end{figure}




\section{Data Efficiency}
\label{appendix:data_efficiency}

In this section, we empirically investigate the \textit{data efficiency} \citep{zhang_mix-n-match_2020} of tunable post-hoc methods, which refers to their ability to learn and generalize from limited data. As is well-known from machine learning theory and practice, the more we evaluate the empirical risk to tune a parameter, the more we are prone to overfitting, which is aggravated as the size of the dataset used for tuning decreases. Thus, a method that require less hyperparameter tuning tends to be more data efficient, i.e., to achieve its optimal performance with less tuning data. We intuitively expect this to be the case for MaxLogit-pNorm, which only requires evaluating a few values of $p$, compared to any method based on the softmax function, which requires tuning a temperature parameter.

As mentioned in Section \ref{experimental_setup}, the experiments conducted in ImageNet used a test set of 45000 images randomly sampled from the available ImageNet validation dataset, resulting in 5000 images for the tuning set. To evaluate data efficiency, the post-hoc optimization process was executed multiple times, using different fractions of the tuning set while keeping the test set fixed. This whole process was repeated 50 times for different random samplings of the test set (always fixed at 45000 images).

Figure \ref{fig:data_efficiency-imagenet} displays the outcomes of these studies for a ResNet50 trained on ImageNet. As observed, MaxLogit-pNorm exhibits outstanding data efficiency, while methods that require temperature optimization achieve lower efficiency.

Furthermore, this experiment was conducted on the VGG19 model for CIFAR-100, as shown in figure \ref{fig:data_efficiency-imagenet}. Indeed, the same conclusions hold for the high efficiency of MaxLogit-pNorm.

\begin{figure}[!htb]
\centering
\begin{subfigure}[t]{0.7\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/DataEfficiency_resnet50_ImageNet.pdf}
    \caption{ResNet50 on ImageNet}
    \label{fig:data_efficiency-imagenet}
\end{subfigure}
\begin{subfigure}[t]{0.7\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/DataEfficiency_VGG_19_Cifar100.pdf}
    \caption{VGG19 on CIFAR-100}
    \label{fig:data_efficiency-cifar}
\end{subfigure}
\caption{Mean NAURC as a function of the number of samples used for tuning the confidence estimator. Filled regions for each curve correspond to $\pm 1$ standard deviation (across 50 realizations).
Dashed lines represent the mean of the NAURC achieved when the optimization is made directly on the test set (giving a lower bound on the optimal value), while dotted lines correspond respectively to
$\pm 1$ standard deviation. (a) ResNet50 on ImageNet. For comparison, the MSP achieves a mean NAURC of 0.3209 (not shown in the figure). (b) VGG19 on CIFAR-100.}
\label{fig:data_efficiency}
\end{figure}


To ensure our finding generalize across models, we repeat this process for all the 84 ImageNet classifiers considered, for a specific tuning set size. This time only 10 realizations of the test set were performed, similarly to the results of Section~\ref{sec:comparison-methods}. Table~\ref{tab:APG-1000} is the equivalent of Table~\ref{tab:APG} for a tuning set of 1000 samples, while Table~\ref{tab:APG-500} corresponds to a tuning set of 500 samples. As can be seen, the results obtained are consistent with those observed previously. In particular, MaxLogit-pNorm provides a statistically significant improvement over all other methods when the tuning set is reduced. Moreover, MaxLogit-pNorm is one of the most stable among the tunable methods in terms of variance of gains.


\begin{table*}[h]
\caption{APG-NAURC (mean {\footnotesize $\pm$std}) of post-hoc methods across 84 ImageNet classifiers, for a tuning set of 1000 samples}
\label{tab:APG-1000}
\centering
\begin{tabular}{lcccc}
\toprule
&\multicolumn{4}{c}{Logit Transformation} \\
\cmidrule(r){2-5}
Conf. Estimator &    Raw & TS-NLL &  TS-AURC &  pNorm \\
\midrule
MSP & 0.00000 {\footnotesize $\pm$0.00000} & 0.03657 {\footnotesize $\pm$0.00084} & 0.05571 {\footnotesize $\pm$0.00164} & 0.06436 {\footnotesize $\pm$0.00413} \\
SoftmaxMargin & 0.01951 {\footnotesize $\pm$0.00010} & 0.04102 {\footnotesize $\pm$0.00052} & 0.05420 {\footnotesize $\pm$0.00134} & 0.06238 {\footnotesize $\pm$0.00416} \\
MaxLogit & 0.00000 {\footnotesize $\pm$0.00000} & - & - & \textbf{0.06795} {\footnotesize $\pm$0.00077} \\
LogitsMargin & 0.05510 {\footnotesize $\pm$0.00059} & - & - & 0.06110 {\footnotesize $\pm$0.00084} \\
NegativeEntropy & 0.00000 {\footnotesize $\pm$0.00000} & 0.01566 {\footnotesize $\pm$0.00182} & 0.05851 {\footnotesize $\pm$0.00055} & 0.06485 {\footnotesize $\pm$0.00176} \\
NegativeGini & 0.00000 {\footnotesize $\pm$0.00000} & 0.03627 {\footnotesize $\pm$0.00095} & 0.05617 {\footnotesize $\pm$0.00162} & 0.06424 {\footnotesize $\pm$0.00390} \\

\bottomrule
\end{tabular}
\end{table*}

\begin{table*}[h]
\caption{APG-NAURC (mean {\footnotesize $\pm$std}) of post-hoc methods across 84 ImageNet classifiers, for a tuning set of 500 samples}
\label{tab:APG-500}
\centering
\begin{tabular}{lcccc}
\toprule
&\multicolumn{4}{c}{Logit Transformation} \\
\cmidrule(r){2-5}
Conf. Estimator &    Raw & TS-NLL &  TS-AURC &  pNorm \\
\midrule
MSP & 0.0 {\footnotesize $\pm$ 0.0} & 0.03614 {\footnotesize $\pm$0.00152} & 0.05198 {\footnotesize $\pm$0.00381} & 0.05835 {\footnotesize $\pm$0.00677} \\
SoftmaxMargin & 0.01955 {\footnotesize $\pm$0.00008} & 0.04083 {\footnotesize $\pm$0.00094} & 0.05048 {\footnotesize $\pm$0.00381} & 0.05601 {\footnotesize $\pm$0.00683} \\
MaxLogit & 0.0 {\footnotesize $\pm$ 0.0} & - & - & \textbf{0.06719} {\footnotesize $\pm$0.00141} \\
LogitsMargin & 0.05531 {\footnotesize $\pm$0.00042} & - & - & 0.06064 {\footnotesize $\pm$0.00081} \\
NegativeEntropy & 0.0 {\footnotesize $\pm$ 0.0} & 0.01487 {\footnotesize $\pm$0.00266} & 0.05808 {\footnotesize $\pm$0.00066} & 0.06270 {\footnotesize $\pm$0.00223} \\
NegativeGini & 0.0 {\footnotesize $\pm$ 0.0} & 0.03578 {\footnotesize $\pm$0.00174} & 0.05250 {\footnotesize $\pm$0.00368} & 0.05832 {\footnotesize $\pm$0.00656} \\
\bottomrule
\end{tabular}
\end{table*}





\section{Ablation study on the choice of \texorpdfstring{$p$}{p}}
\label{appendix:ablation}

A natural question regarding $p$-norm normalization (with a general $p$) is whether it can provide any benefits beyond the default $p=2$ used by \citet{wei_mitigating_2022}. Table~\ref{tab:p_ablation} shows the APG-NAURC results for the 84 ImageNet classifiers when different values of $p$ are kept fixed and when $p$ is optimized for each model (tunable).

\begin{table}[h]
\caption{APG-NAURC (mean {\footnotesize $\pm$std}) across 84 ImageNet classifiers, for different values of $p$}
\label{tab:p_ablation}
\centering
\begin{tabular}{lrr}
\toprule
&\multicolumn{2}{c}{Confidence Estimator} \\
\cmidrule(r){2-3}
$p$ & MaxLogit-pNorm & MSP-pNorm \\
\midrule
0& 0.00000 {\footnotesize $\pm$0.00000}& 0.05769 {\footnotesize $\pm$0.00038} \\
1& 0.00199 {\footnotesize $\pm$0.00007}& 0.05990 {\footnotesize $\pm$0.00062} \\
2& 0.01519 {\footnotesize $\pm$0.00050}& 0.06486 {\footnotesize $\pm$0.00054} \\
3& 0.05058 {\footnotesize $\pm$0.00049}& 0.06748 {\footnotesize $\pm$0.00048} \\
4& 0.06443 {\footnotesize $\pm$0.00051}& 0.06823 {\footnotesize $\pm$0.00047} \\
5& 0.06805 {\footnotesize $\pm$0.00048}& 0.06809 {\footnotesize $\pm$0.00048} \\
6& 0.06814 {\footnotesize $\pm$0.00048}& 0.06763 {\footnotesize $\pm$0.00049} \\
7& 0.06692 {\footnotesize $\pm$0.00053}& 0.06727 {\footnotesize $\pm$0.00048} \\
8& 0.06544 {\footnotesize $\pm$0.00048}& 0.06703 {\footnotesize $\pm$0.00048} \\
9& 0.06410 {\footnotesize $\pm$0.00048}& 0.06690 {\footnotesize $\pm$0.00048} \\
Tunable & \textbf{0.06863} {\footnotesize $\pm$0.00045}& 0.06796 {\footnotesize $\pm$0.00051} \\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[h]
\caption{APG-NAURC (mean {\footnotesize $\pm$std}) across 84 ImageNet classifiers, for different values of $p$ for a tuning set of 1000 samples}
\label{tab:p_ablation_1000}
\centering
\begin{tabular}{lrr}
\toprule
&\multicolumn{2}{c}{Confidence Estimator} \\
\cmidrule(r){2-3}
$p$ & MaxLogit-pNorm & MSP-pNorm \\
\midrule
0& 0.00000 {\footnotesize $\pm$0.00000}& 0.05571 {\footnotesize $\pm$0.00164} \\
1& 0.00199 {\footnotesize $\pm$0.00007}& 0.05699 {\footnotesize $\pm$0.00365} \\
2& 0.01519 {\footnotesize $\pm$0.00050}& 0.06234 {\footnotesize $\pm$0.00329} \\
3& 0.05058 {\footnotesize $\pm$0.00049}& 0.06527 {\footnotesize $\pm$0.00340} \\
4& 0.06443 {\footnotesize $\pm$0.00051}& 0.06621 {\footnotesize $\pm$0.00375} \\
5& 0.06805 {\footnotesize $\pm$0.00048}& 0.06625 {\footnotesize $\pm$0.00338} \\
6& \textbf{0.06814} {\footnotesize $\pm$0.00048}& 0.06589 {\footnotesize $\pm$0.00332} \\
7& 0.06692 {\footnotesize $\pm$0.00053}& 0.06551 {\footnotesize $\pm$0.00318} \\
8& 0.06544 {\footnotesize $\pm$0.00048}& 0.06512 {\footnotesize $\pm$0.00345} \\
9& 0.06410 {\footnotesize $\pm$0.00048}& 0.06491 {\footnotesize $\pm$0.00329} \\
Tunable& 0.06795 {\footnotesize $\pm$0.00077}& 0.06436 {\footnotesize $\pm$0.00413} \\
\bottomrule
\end{tabular}
\end{table}

As can be seen, there is a significant benefit of using a larger $p$ (especially a tunable one) compared to simply using $p=2$, especially for MaxLogit-pNorm. Note that, differently from MaxLogit-pNorm, MSP-pNorm requires temperature optimization. This additional tuning is detrimental to data efficiency, which is evidenced by the loss in performance of MSP-pNorm using a tuning set of 1000 samples, as shown in Table~\ref{tab:p_ablation_1000}.


\section{Logits translation}
\label{appendix:centralization}

In \autoref{sec:logit-transformations} we proposed $p$-normalization applied together with the centralization of the logits. In this section, we aim to provide an ablation of this centralization procedure and the effects of the translation of logits.

First of all, it is worth noting that the softmax function is translation invariant, i.e.,
\begin{equation}
    \sigma(\mathbf{z}) = \sigma(\mathbf{z}+\gamma) \quad \forall \gamma\in\mathbb{R}.
\end{equation}

As the general loss (i.e. the cross-entropy loss) takes as input only the softmax outputs, the logits after convergence might have arbitrarily mean/offsets. Moreover, the following properties become relevant when dealing with selective classification:
\begin{itemize}
    \item All methods in which the $p$-normalization is applied on the logits are sensitive to any constant summed up to the them;
    \item The sum of the same constant for all samples does not change the ranking between them when the MaxLogit is used as the confidence estimator. However, when a constant different for each sample (such as the centralization) is considered, the ranking might be affected;
    \item The LogitsMargin is totally insensitive to the translation of the logits;
    \item All methods using softmax \emph{without} $p$-normalization are insensitive to the translation of the logits.
\end{itemize}

In order to study the impact of the translation of logits on the MaxLogit-pNorm method, we will start by proposing an alternative post-hoc method:
\begin{equation}
    \text{MaxLogit-pNorm-shift}(\mathbf{z},\Gamma,\gamma) \triangleq \text{MaxLogit}\left(\frac{\mathbf{z}-\Gamma(\mathbf{z})+\gamma}{||\mathbf{z}-\Gamma(\mathbf{z})+\gamma||_p}\right),
\end{equation}
where $\Gamma\colon \mathbb{R}^C\to\mathbb{R}$ is a function of the logits (such as the mean function) and $\gamma \in \mathbb{R}$ is a constant to be optimized together with $p$. The optimization of $\gamma$ is performed with a grid search in the range of [-3,3].

\autoref{tab:APG-centralization-ablation} shows the APG-NAURC for all 84 models considered in this work on ImageNet when using different possibilites of $\Gamma$ and $\gamma$. Specifically, we considered the cases where $\gamma$ is 0 and when it is optimized in a hold-out set; for $\Gamma$, we considered $\Gamma(\bz) = 0$, $\Gamma(\bz) = \mu(\bz)$ (for centralization) and $\Gamma(\bz) = \min_j z_j$ (to align the minimum value of all samples to 0, making all logits positive).
%: a function identically equal to zero, the centralization ($\Gamma(\mathbf{z}) = \mu(\mathbf{z})$) and to align the minimum value of all samples to 0, making all logits positive ($\Gamma(\mathbf{z}) = \min_i z_i$). 
As can be seen, optimizing $\gamma$ does not provide significant gains and 
%can lead us to the best results, but 
can lead to overfitting in a low data regime; thus, in the main method we discarded this constant.
%. Due to its lower data efficiency and negligible gains, in the main method we discarded this constant. 
For $\gamma = 0$, choosing $\Gamma(\bz) = \mu(\bz)$ provides the highest gains, which, although relatively small compared to $\Gamma(\bz) = 0$, certainly do not harm performance. Since this operation is computationally cheap, does not require optimization, and allows the use of softmax probabilities directly (as mentioned in \autoref{sec:logit-transformations}), we decided to adopt it in the main method.
%Finally, the optimal tested possibility is the simple centralization, as proposed in \autoref{sec:logit-transformations}.

\begin{table}[h]
\caption{APG-NAURC (mean {\footnotesize $\pm$std}) of MaxLogit-pNorm-shift for different selections of $\Gamma$ and $\gamma$. $\gamma^*$ represents the value that optimizes the AURC in the hold-out dataset.}
\label{tab:APG-centralization-ablation}
\centering
\begin{tabular}{lcccc}
\toprule
&\multicolumn{2}{c}{5000 hold-out samples}&\multicolumn{2}{c}{1000 hold-out samples} \\
\cmidrule(r){2-3}
\cmidrule(r){4-5}
$\Gamma(\mathbf{z})$ & $\gamma=0$ & $\gamma = \gamma^*$ & $\gamma=0$ & $\gamma = \gamma^*$\\
\midrule
0 & 0.06833 {\footnotesize $\pm$0.00044} & 0.06866 {\footnotesize $\pm$0.00044} & 0.06760 {\footnotesize $\pm$0.00077} & 0.06738 {\footnotesize $\pm$0.00091} \\

$\mu(\mathbf{z})$ & \textbf{0.06863} {\footnotesize $\pm$0.00045} & \textbf{0.06867} {\footnotesize $\pm$0.00045} & \textbf{0.06795} {\footnotesize $\pm$0.00077} & \textbf{0.06742} {\footnotesize $\pm$0.00093} \\

$\min_j z_j$ & 0.06668 {\footnotesize $\pm$0.00049} & 0.06658 {\footnotesize $\pm$0.00056} & 0.06626 {\footnotesize $\pm$0.00073} & 0.06523 {\footnotesize $\pm$0.00151} \\
\bottomrule
\end{tabular}
\end{table}


% apesar do ganho ser pequeno em média, ele pode ser bem expressivo para modelos especificos
\autoref{fig:centralization_gains} shows the difference in NAURC when $\Gamma(\mathbf{z}) = \mu(\mathbf{z})$ and when $\Gamma(\mathbf{z}) = 0$ (for $\gamma=0$), as well as the average across all test samples of the mean of the logits for all methods in which the MaxLogit-pNorm wields gains (i.e., the MSP fallback is not applied). It can be observed that most models already output their logits with almost zero mean, making centralization unnecessary. However, a few models with nonzero logits means present considerable gains in centralization.

\begin{figure}[!htb]
    \centering
    \includegraphics[width=0.6\textwidth]{figs/LogitsMean_centralizationgain_ImageNet.pdf}
    \caption{Gains in NAURC when the centralization is applied to the logits in relation to the average of all logits in the test dataset. Colors represent the gain of MaxLogit-pNorm over MSP.}
    \label{fig:centralization_gains}
\end{figure}


\section{Comparison with other tunable methods}
\label{appendix:other-methods}

In Section~\ref{sec:comparison-methods} we compared several logit-based confidence estimators obtained by combining a parameterless confidence estimator with a tunable logit transformation, specifically, TS and $p$-norm normalization. In this section, we consider other previously proposed tunable confidence estimators that do not fit into this framework.

Note that some of these methods were originally proposed seeking calibration, and hence its hyperparameters were tuned to optimize the NLL loss (which is usually suboptimal for selective classification). Instead, to make a fair comparison, we optimized all of their parameters using the AURC metric as the objective metric.

\citet{zhang_mix-n-match_2020} proposed \textit{ensemble temperature scaling} (ETS):
\begin{equation}
\text{ETS}(\bz) \triangleq w_1\text{MSP}\left(\frac{\bz}{T}\right) + w_2\text{MSP}(\bz) + w_3\frac{1}{C}
\end{equation}
where $w_1,w_2,w_3 \in \mathbb{R}^+$ are tunable parameters and $T$ is the temperature previously obtained through the temperature scaling method. The grid for both $w_1$ and $w_2$ was $[0,1]$ as suggested by the authors, with a step size of 0.01, while the parameter $w_3$ was not considered since the sum of a constant to the confidence estimator cannot change the ranking between samples and consequently cannot change the value of selective classification metrics.

\citet{boursinos2022selective} proposed the following confidence estimator, referred to here as \textit{Boursinos-Koutsoukos} (BK):
\begin{equation}
\text{BK}(\bz) \triangleq a \text{MSP}(\bz) + b (1-\max_{k \in \calY: k \neq \hat{y}} \sigma_k(\bz))
\end{equation}
where $a, b \in \mathbb{R}$ are tunable parameters. The grid for both $a$ and $b$ was $[-1,1]$ as suggested by the authors, with a step size of 0.01, although we note that the optimization never found $a < 0$ (probably due to the high value of the MSP as a confidence estimator).

Finally, \citet{balanya_adaptive_2022} proposed \textit{entropy-based temperature scaling} (HTS):
\begin{equation}
\label{hts_balanya}
\text{HTS}(\bz) \triangleq \text{MSP}\left(\frac{\bz}{T_H(\bz)}\right)
\end{equation}
where $T_{H}(\bz) = \log\left(1 + \exp(b + w \log \bar{H}(\bz) ) \right)$, $\bar{H}(\bz) = -(1/C) \sum_{k \in \calY} \sigma_k(\bz) \log \sigma_k(\bz)$,
and $b, w \in\mathbb{R}$ are tunable parameters. The grids for $b$ and $w$ were, respectively, $[-3,1]$ and $[-1,1]$, with a step size of 0.01, and we note that the optimal parameters were always strictly inside the grid.

The results for these post-hoc methods are shown in Table~\ref{tab:extra_methods} and Table~\ref{tab:extra_methods_1000}. Interestingly, BK, which can be seen as a tunable linear combination of MSP and SoftmaxMargin, is able to outperform both of them, although it still underperforms MSP-TS. On the other hand, ETS, which is a tunable linear combination of MSP and MSP-TS, attains exactly the same performance as MSP-TS. Finally, HTS, which is a generalization of MSP-TS, is able to outperform it, although it still underperforms most methods that use $p$-norm tuning (see Table~\ref{tab:APG}). In particular, MaxLogit-pNorm shows superior performance to all of these methods, while requiring much less hyperparameter tuning.

\begin{table}[h]
\centering
\caption{APG-NAURC of additional tunable post-hoc methods across 84 ImageNet classifiers}
\label{tab:extra_methods}
\begin{tabular}{cc}
\toprule
Method& APG-NAURC \\
\midrule
BK & 0.03932 {\footnotesize $\pm$0.00031} \\
ETS & 0.05768 {\footnotesize $\pm$0.00037} \\
HTS & 0.06309 {\footnotesize $\pm$0.00034} \\
MaxLogit-pNorm & \textbf{0.06863} {\footnotesize $\pm$0.00045} \\
\bottomrule
\end{tabular}
\end{table}


\begin{table}[h]
\centering
\caption{APG-NAURC of additional tunable post-hoc methods across 84 ImageNet classifiers for a tuning set with 1000 samples}
\label{tab:extra_methods_1000}
\begin{tabular}{cc}
\toprule
Method& APG-NAURC \\
\midrule
BK & 0.03795 {\footnotesize $\pm$0.00067} \\
ETS & 0.05569 {\footnotesize $\pm$0.00165} \\
HTS & 0.05927 {\footnotesize $\pm$0.00280} \\
MaxLogit-pNorm & \textbf{0.06795} {\footnotesize $\pm$0.00077} \\
\bottomrule
\end{tabular}
\end{table}


Methods with a larger number of tunable parameters, such as PTS \citep{tomani_parameterized_2022} and HnLTS \citep{balanya_adaptive_2022}, are only viable with a differentiable loss. As these methods are proposed for calibration, the NLL loss is used; however, as previous works have shown that this does not always improve and sometimes even harm selective classification \citep{zhu_rethinking_2023,galil_what_2023}, these methods were not considered in our work. The investigation of alternative methods for optimizing selective classification (such as proposing differentiable losses or more efficient zero-order methods) is left as a suggestion for future work.
In any case, note that using a large number of hyperparameters is likely to harm data efficiency.

We also evaluated additional parameterless confidence estimators proposed for selective classification \citep{hasan2023survey_posthoc}, such as LDAM \citep{he2011rejection} and the method in \citep{leon2018new}, both in their raw form and with TS/pNorm optimization, but none of these methods showed any gain over the MSP. Note that the Gini index, sometimes proposed as a post-hoc method \citep{hasan2023survey_posthoc} (and also known as \textsc{Doctor}'s $D_\alpha$ method \citep{granese2021doctor}) has already been covered in Section~\ref{section:confidence-estimation}.


\section{Calibration Results}
\label{appendix:investigation}

If the confidence estimation $g(x)$ of a model can be treated as a probability, as is the case with the MSP, it is natural to desire that it truly reflects the probability of a prediction to be correct. A model is said to be perfectly \emph{calibrated} if:

\begin{equation}
    \mathbb{P}[\hat{y}=y|g(x)=p] = p, \forall p\in [0,1]
\end{equation}

One popular framework to measure calibration in a finite dataset is to use binning. If we group predictions into $M$ interval bins with same size, and if $B_m$ is a set of indices of samples whose prediction confidence belongs to the interval $\left(\frac{m-1}{M}, \frac{m}{M}\right]$, we calculate the accuracy of bin $B_m$ as:
\begin{equation}
    \text{acc}(B_m) = \frac{1}{|B_m|}\sum_{i\in B_m} \indicator[\hat{y}_i = y_i]
\end{equation}
where $\hat{y}_i$ and $y_i$ are the predicted and the true classes of sample $i$ and $|B_m|$ is the number of samples in the bin. The average confidence of the same bin is calculated as:

\begin{equation}
    \text{conf}(B_m) = \frac{1}{|B_m|} \sum_{i\in B_m} g_i(x)
\end{equation}

From these definitions, the most popular metric for measuring the calibration is the Expected Calibration Error \citep{naeini2015obtaining}, defined as:

\begin{equation}
    \text{ECE}(g) \triangleq \sum_{m=1}^M \frac{|B_m|}{n} \left| \text{acc}(B_m) - conf(B_m)\right|
\end{equation}

 It is important to re-emphasize that calibration and metrics such as ECE are defined in a context where $g(x)$ can be treated as a probability. Hence, we only present the results for uncertainty quantifiers that have this property/intention. The ECE values for all considered methods (optimized for the AURC) for which $g(x)$ can be considered as a probability are presented in \autoref{tab:results_ece}. Additionally,
\autoref{fig:ece-imagenet} shows the reliability diagrams \citep{guo_calibration_2017} of different classifiers of ImageNet. For comparison, since MaxLogit-pNorm can only return values between 0 and 1, we also present its reliability curve in \autoref{fig:ece-imagenet}, even though its values should not be interpreted as a probability. As can be seen, the models (EfficientNetV2-XL and WideResNet50-2) with ``broken'' selective mechanism tend to have the MSP under-confident, and, while the TS-NLL can minimize the ECE, the MSP variation which optimizes selective classification (MSP-pNorm) can achieve bad calibration results, with overconfident predictions. 


\begin{table*}[h]
\caption{ECE (mean {\footnotesize $\pm$std}) for post-hoc methods applied to ImageNet classifiers
}
\label{tab:results_ece}
\centering
\begin{tabular}{lc}
\toprule
Method & ECE \\
\midrule
MSP & 0.13060 {\footnotesize $\pm$0.00014} \\
MSP-TS-NLL & \textbf{0.02990} {\footnotesize $\pm$0.00109} \\
MSP-TS-AURC & 0.10395 {\footnotesize $\pm$0.00341} \\
MSP-pNorm & 0.10786 {\footnotesize $\pm$0.04860} \\
%MaxLogit-pNorm & 0.16087 {\footnotesize $\pm$0.00266} \\
\bottomrule
\end{tabular}
\end{table*}


\begin{figure}[h]
\centering
\begin{subfigure}[t]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/ECE-vgg16.pdf}
    \caption{VGG16}
    \label{fig:ece-vgg}
\end{subfigure}
\begin{subfigure}[t]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/ECE-wide_resnet50_2.pdf}
    \caption{WideResNet50-2}
    \label{fig:ece-wideresnet}
\end{subfigure}
\begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figs/ECE-efficientnetv2_xl.pdf}
    \caption{EfficientNetV2-XL}
    \label{fig:ece-efficientnet}
\end{subfigure}
\caption{Reliability diagrams of different methods applied on VGG16, WideResNet50-2 and EfficientNetV2-XL on ImageNet. Dashed black line indicates perfect calibration. For MaxLogit-pNorm, we do not present the ECE metric since this method is not treated as a probability.}
\label{fig:ece-imagenet}
\end{figure}

These results goes against the natural hypothesis that overconfidence is a huge problem in uncertainty estimation of neural networks. Thus, we present further investigations regarding the relation between the selective classification anomaly and the over/underconfidence phenomenon. Figure~\ref{fig:histograms} shows histograms of confidence values for two representative examples of non-improvable and improvable models, with the latter one shown before and after post-hoc optimization. Figure~\ref{fig:gains_proportion_confident} shows the NAURC gain over MSP versus the proportion of samples with high MSP for each classifier. As can be seen, highly confident models tend to have a good MSP confidence estimator, while less confident models tend to have a poor confidence estimator that is easily improvable by post-hoc methods---after which the resulting confidence estimator becomes concentrated on high values.


\begin{figure*}[!htb]
\centering
\begin{subfigure}[t]{0.35\linewidth}
\centering
\includegraphics[width=\textwidth]{figs/histogram_msp_vgg16_ImageNet.pdf}
\caption{VGG16 - Baseline}
\end{subfigure}
\centering
\begin{subfigure}[t]{0.35\linewidth}
\centering
\includegraphics[width=\textwidth]{figs/histogram_msp_vgg16_ImageNet.pdf}
\caption{VGG16 after MaxLogit-pNorm optimization (fallback) - NAURC gain = 0}
\end{subfigure}
\begin{subfigure}[t]{0.35\textwidth}
\centering
\includegraphics[width=\linewidth]{figs/histogram_msp_wide_resnet50_2_ImageNet.pdf}
\caption{WideResNet50-2 - Baseline}
\end{subfigure}
\begin{subfigure}[t]{0.35\textwidth}
\centering
\includegraphics[width=\linewidth]{figs/histogram_optimized_wide_resnet50_2_ImageNet.pdf}
\caption{WideResNet50-2 after MaxLogit-pNorm optimization - NAURC gain = 0.02376}
\end{subfigure}\caption{Histograms of confidence values for VGG16 and WideResNet50-2 before and after post-hoc optimization on ImageNet.}
\label{fig:histograms}
\end{figure*}

\begin{figure}
    \centering
    \includegraphics[width=0.6\textwidth]{figs/msp_proportion_imagenet.pdf}
    \caption{NAURC gain versus the proportion of samples with $\text{MSP} > 0.999$.}
    \label{fig:gains_proportion_confident}
\end{figure}

\section{Full Results on ImageNet}
\label{appendix:results_imagenet}
Table \ref{tab:results_imagenet_naurc} presents all the NAURC results for the most relevant methods for all the models evaluated on ImageNet, while Table \ref{tab:results_imagenet_aurc} shows the corresponding AURC results and Table \ref{tab:results_imagenet_auroc} the corresponding AUROC results. $p^*$ denotes the optimal value of $p$ obtained for the corresponding method, while $p^*=\text{F}$ denotes MSP fallback.

\begin{landscape}
\input{table_naurc.tex}
\end{landscape}

\begin{landscape}
\input{table_aurc.tex}
\end{landscape}

\begin{landscape}
\input{table_auroc.tex}
\end{landscape}



