\section{Discussion}
\label{sec:discussion}


Our work has identified two broad classes of trained models (which comprise 31\% and 69\% of our sample, respectively): models for which the MSP is apparently an already optimal confidence estimator, in the sense that is not improvable by any of the post-hoc methods we evaluated; and models for which the MSP is suboptimal, in which case all of the best methods evaluated produce highly correlated gains.
As a consequence, a few questions naturally arise.

\textbf{Why is the MSP such a strong baseline in many cases but easily improvable in many others?}
As mentioned in Section~\ref{section:confidence-estimation}, the MSP is the optimal confidence estimator if the softmax output provides the exact class-posterior distribution. While this is obviously not the case in general, \textit{if the model is designed and trained to estimate this posterior}, e.g., by minimizing the NLL, then it is unlikely that a better estimate can be found by simple post-hoc optimization. For instance, the optimal temperature parameter could be easily learned during training and, more generally, any beneficial logit transformation would already be made part of the model architecture to maximize performance. However, modern deep learning classifiers are often trained and tuned with the goal of maximizing validation accuracy rather than validation NLL, resulting in overfitting of the latter. Indeed, this was the explanation offered in \cite{guo_calibration_2017} for the emergence of overconfidence which motivated their proposal of TS. Similarly, \cite{wei_mitigating_2022} identified a specific mechanism that could cause this overconfidence, namely, an increasing magnitude of logits during training, which motivated their proposal of logit normalization (see Appendix~\ref{appendix:logit-norm} for more details). Thus, overconfidence could be the main cause of poor selective classification performance and simple post-hoc tuning could be able to easily improve it. While our results clearly prove this second hypothesis, they actually \textit{disprove} the first, as shown below.

\textbf{What is the cause of poor selective classification performance?}
According to our experiments in Appendix~\ref{appendix:investigation}, models that produce highly confident MSPs tend to have better confidence estimators (in terms of NAURC), while models whose MSP distribution is more balanced tend to be easily improvable by post-hoc optimization---which, in turn, makes the resulting confidence estimator concentrated on highly confident values. In other words, overconfidence is not necessarily a problem for selective classification, but underconfidence may be. 
%While the cause of this underconfidence is yet to be identified, 
While the root causes of this underconfidence are currently under investigation, some natural suspects are techniques that create soft labels, such as label smoothing \citep{szegedy2016rethinking} and mixup augmentation \citep{zhang2017mixup}, which are present in modern training recipes
%\footnote{https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/} 
and have already been shown in \citep{zhu_rethinking_2023} to be harmful for misclassification detection. In any case,
our results reinforce the observations in previous works \citep{zhu_rethinking_2023, galil_what_2023} that---except in the special case where an ideal probabilistic model can be found---calibration and selective classification are distinct problems and optimizing one may harm the other. In particular, the method with best calibration performance (TS-NLL) achieves only small gains in NAURC, while the method with highest NAURC gains that still deliver probabilities (MSP-pNorm) does not significantly improve calibration and sometimes harms it.

\textbf{Why are the gains of all methods highly correlated? Why does post-hoc logit normalization improve performance at all?}
One particular case of underconfidence is when the model incorrectly attributes too much posterior probability mass to the least probable classes (e.g., when all classes except the predicted one have the same probability). In this case,  LogitsMargin, which effectively disregards all logits except the highest two, may be a better confidence estimator. However, as shown in Appendix~\ref{appendix:logit-norm}, MSP-TS with small $T$ approximates LogitsMargin, while MaxLogit-pNorm with $p=1/T$ is closely related to the MSP-TS. Thus, all methods combat underconfidence in a similar way by focus on the largest logits and therefore give highly correlated gains. Moreover, this explains why using a sufficiently large $p$ is essential in post-hoc $p$-norm logit normalization.
%
On the other hand, as also shown in Appendix~\ref{appendix:logit-norm}, due to its unique characteristics, MaxLogit-pNorm is even more effective than MSP-TS in combatting this particular form of underconfidence, since it can effectively discard the smallest, least reliable logits without penalizing largest ones.