
In this paper, we addressed the problem of selective multiclass classification for deep neural networks with softmax outputs. Specifically, we considered the design of post-hoc confidence estimators that can be computed directly from the unnormalized logits. We performed an extensive benchmark of more than 20 tunable post-hoc methods across 84 ImageNet classifiers, establishing strong baselines for future research. To allow for a fair comparison, we proposed a normalized version of the AURC metric that is insensitive to the classifier accuracy. 

Our main conclusions are the following: (1) For 58 (69\%) of the models considered, considerable NAURC gains over the MSP can be obtained, in one case achieving a reduction of 0.27 points or about 62\%.
(2) Our proposed method MaxLogit-pNorm (which does not use a softmax function) emerges as a clear winner, providing the highest gains with exceptional data efficiency, requiring on average less than 1 sample per class of hold-out data for tuning its single hyperparameter. These observations are also confirmed under additional datasets and the gains preserved even under distribution shift.
(3)~After post-hoc optimization, all models with a similar accuracy achieve a similar level of confidence estimation performance, even models that have been previously shown to be very poor at this task. In particular, the selective classification performance of any classifier becomes almost entirely determined by its corresponding accuracy, eliminating the seemingly existing tradeoff between these two goals reported in previous work. 
(4) Selective classification performance itself appears to be robust to distribution shift, in the sense that, although it naturally degrades, this degradation is not larger than what would be expected by the corresponding accuracy drop.

%Two questions naturally emerge from our results, which are left as suggestions for future work. Can better performance be attainable with more complex post-hoc methods under limited (or even unlimited) tuning data? What exact properties of a classifier or training regime make it improvable by post-hoc methods? Our investigation suggests that the issue is related to underconfidence, but a complete explanation is still elusive.

We have also investigated what makes a classifier easily improvable by post-hoc methods and found that the issue is related to underconfidence. The root causes of this underconfidence are currently under investigation and will be the subject of our future work.
