\label{sec:related-work}

Selective prediction is also known as learning with a reject option (see \citep{zhang_survey_2023, Hendrickx2021ML} and references therein), where the rejector is usually a thresholded confidence estimator.\footnote{An interesting application is enabling efficient inference with model cascades \citep{Lebovitz.etal.2023.Efficient-Inference-Model}, although the literature on those topics appears disconnected.}
Essentially the same problem is studied under the equivalent terms misclassification detection \citep{hendrycks_baseline_2018}, failure prediction \citep{corbiere_confidence_2021,zhu_rethinking_2023}, and (ordinal) ranking \citep{moon_confidence-aware_2020,galil_what_2023}. Uncertainty estimation is a more general term that encompasses these tasks (where confidence may be taken as negative uncertainty) as well as other tasks where uncertainty might be useful, such as calibration and out-of-distribution (OOD) detection, among others \citep{gawlikowski_survey_2022,abdar_review_2021}. These tasks are generally not aligned: for instance, optimizing for calibration may harm selective classification performance \citep{ding_revisiting_2020,zhu_rethinking_2023,galil_what_2023}. Our focus here is on in-distribution selective classification, although we also study robustness to distribution shift. 

%Interestingly, the same principles of selective classification can be applied to enable efficient inference with model cascades \citep{Lebovitz.etal.2023.Efficient-Inference-Model}, although the literature on those topics appears disconnected.

Most approaches to selective classification consider the base model as part of the learning problem \citep{geifman_selectivenet_2019, huang_self-adaptive_2020, liu_deep_2019}, which we refer to as training-based approaches. While such an approach has a theoretical appeal, the fact that it requires retraining a model is a significant practical drawback. Alternatively, one may keep the model fixed and only modify or replace the confidence estimator, which is known as a post-hoc approach. Such an approach is practically appealing and perhaps more realistic, as it does not require retraining. Some papers that follow this approach construct a \textit{meta-model} that feeds on intermediate features of the base model and is trained to predict whether or not the base model is correct on hold-out samples \citep{corbiere_confidence_2021, shen_post-hoc_2022}. However, depending on the size of such a meta-model, its training may still be computationally demanding. 

A popular tool in the uncertainty literature is the use of ensembles \citep{lakshminarayanan_simple_2017,teye_bayesian_2018,ayhan_test-time_nodate}, of which Monte-Carlo dropout \cite{gal_dropout_2016} is a prominent example. While constructing a confidence estimator from ensemble component outputs may be considered post-hoc if the ensemble is already trained, the fact that multiple inference passes need to be performed significantly increases the computational burden at test time. Moreover, recent work has found evidence that ensembles may not be fundamental for uncertainty but simply better predictive models \citep{abe_deep_2022,cattelan_performance_2022,xia_usefulness_2022}. Thus, we do not consider ensembles here.

In this work we focus on simple post-hoc confidence estimators for softmax networks that can be directly computed from the logits. The earliest example of such a post-hoc method used for selective classification in a real-world application seems to be the use of LogitsMargin in \citep{LeCun.etal.1990.Handwritten-Zip-Code}. While potentially suboptimal, such methods are extremely simple to apply on top of any trained classifier and should be natural choice to try before any more complex technique. In fact, it is not entirely obvious how a training-based approach should be compared to a post-hoc method. For instance, \citet{feng_towards_2023} has found that, for some state-of-the-art training-based approaches to selective classification, \textit{after} the main classifier has been trained with the corresponding technique, better selective classification performance can be obtained by discarding the auxiliary output providing confidence values and simply use the conventional MSP as the confidence estimator. Thus, in this sense, the MSP can be seen as a strong baseline.

Post-hoc methods have been widely considered in the context of calibration, among which the most popular approach is temperature scaling (TS). Applying TS to improve calibration (of the MSP confidence estimator) was originally proposed in \citep{guo_calibration_2017} based on the negative log-likelihood. Optimizing TS for other metrics has been explored in \citep{mukhoti_calibrating_2020,karandikar_soft_2022,clarte_expectation_2023} for calibration and in \citep{liang_enhancing_2020} for OOD detection, but had not been proposed for selective classification. A generalization of TS is adaptive TS (ATS) \citep{balanya_adaptive_2022}, which uses an input-dependent temperature based on logits. The post-hoc methods we consider here can be seen as a special case of ATS, as logit norms may be seen as an input-dependent temperature; however \citet{balanya_adaptive_2022} investigate a different temperature function and focuses on calibration. (For more discussion on this and other post-hoc methods inspired by calibration, please see Appendix~\ref{appendix:other-methods}.) Other logit-based confidence estimators proposed for calibration and OOD detection include \citep{liu2020energy,tomani_parameterized_2022,rahimi_post-hoc_2022,neumann2018relaxed,gonsior_softmax_2022}.

Normalizing the logits with the $L_2$ norm before applying the softmax function was used in \citep{kornblith_why_2021} and later proposed and studied in \citep{wei_mitigating_2022} as a training technique (combined with TS) to improve OOD detection and calibration. A variation where the logits are normalized to unit variance was proposed in \citep{jiang_normsoftmax_2023} to accelerate training. In contrast, we propose to use logit normalization as a post-hoc method for selective classification, extend it to general $p$-norm, consider a tunable~$p$ with AURC as the optimization objective, and allow it to be used with confidence estimators other than the MSP, all of which are new ideas which depart significantly from previous work. 

Benchmarking of models in their performance at selective classification/misclassification detection has been done in \citep{galil_what_2023, ding_revisiting_2020}, however these works mostly consider the MSP as the confidence estimator. In particular, a thorough evaluation of potential post-hoc estimators for selective classification as done in this work had not yet appeared in the literature. The work furthest in that direction is the paper by \citet{galil_what_2023}, who empirically evaluated ImageNet classifiers and found that TS-NLL improved selective classification performance for some models but degraded it for others. In the context of calibration, \citet{wang_rethinking_2021} and \citet{ashukha_pitfalls_2021} have argued that models should be compared after simple post-hoc optimizations, since models that appear worse than others can sometimes easily be improved by methods such as TS. Here we advocate and provide further evidence for this approach in the context of selective classification.
