\label{section:background}
\subsection{Selective Classification}

Let $P$ be an unknown distribution over $\calX \times \calY$, where $\calX$ is the input space and $\calY = \{1, \ldots, C\}$ is the label space, and $C$ is the number of classes. 
The \textit{risk} of a \textit{classifier} $h: \calX \to \calY$ is $R(h) = E_P[\ell(h(x), y)]$, where $\ell: \calY \times \calY \to \RR^+$ is a loss function, for instance, the 0/1 loss $\ell(\hat{y}, y) = \indicator[\hat{y} \neq y]$, where $\indicator[\cdot]$ denotes the indicator function. A \textit{selective classifier} \citep{geifman_selective_2017} is a pair $(h, g)$, where $h$ is a classifier and $g: \calX \to \RR$ is a \textit{confidence estimator} (also known as \textit{confidence score function} or \textit{confidence-rate function}), which quantifies the model's confidence on its prediction for a given input. For some fixed threshold~$t$, given an input~$x$, the selective model makes a prediction $h(x)$ if $g(x) \geq t$, otherwise the prediction is rejected. A selective model's \textit{coverage} $\phi(h, g) = P[g(x) \geq t]$ is the probability mass of the selected samples in $\calX$, while its \textit{selective risk} $R(h,g) = E_P[\ell(h(x), y) \mid g(x) \geq t]$ is its risk restricted to the selected samples.
%
%A \textit{classifier} is a prediction function $h: \calX \to \calY$. The classifier's (true) \textit{risk} is $R(h) = E_P[\ell(h(x), y)]$, where $\ell: \calY \times \calY \to \RR^+$ is a given loss function, for instance, the 0/1 loss $\ell(\hat{y}, y) = \indicator[\hat{y} \neq y]$, where $\indicator[\cdot]$ denotes the indicator function. A \textit{selective classifier} \citep{geifman_selective_2017} is a pair $(h, g)$, where $h$ is a classifier and $g: \calX \to \RR$ is a \textit{confidence estimator} (also known as \textit{confidence score function} or \textit{confidence-rate function}), which quantifies the model's confidence on its prediction for a given input. For some fixed threshold~$t$, given an input $x$, the selective model makes a prediction $h(x)$ if $g(x) \geq t$, otherwise it abstains from making a prediction. We say that $x$ is \textit{selected} in the former case and \textit{rejected} in the latter. A selective model's \textit{coverage} $\phi(h, g) = P[g(x) \geq t]$ is the probability mass of the selected samples in $\calX$, while its \textit{selective risk} $R(h,g) = E_P[\ell(h(x), y) \mid g(x) \geq t]$ is its risk restricted to the selected samples. 
%
In particular, a model's risk equals its selective risk at \textit{full coverage} (i.e., for $t$ such that $\phi(h, g) = 1$). These quantities can be evaluated empirically given a given a test dataset $\{(x_i, y_i)\}_{i=1}^N$ drawn i.i.d.\ from~$P$, yielding the \textit{empirical coverage} $\hat{\phi}(h, g) = (1/N)\sum_{i=1}^N \indicator[g(x_i) \geq t]$ and the \textit{empirical selective risk}
\begin{equation}
\label{selective_risk}
\hat{R}(h,g) = \frac{\sum_{i=1}^N\ell(h(x_i), y_i)\indicator[g(x_i)\geq t]}{\sum_{i=1}^N \indicator[g(x_i)\geq t]}.
\end{equation}
Note that, by varying $t$, it is generally possible to trade off coverage for selective risk, i.e., a lower selective risk can usually (but not necessarily always) be achieved if more samples are rejected. This tradeoff is captured by the \textit{risk-coverage (RC) curve} \citep{geifman_selective_2017}, a plot of $\hat{R}(h,g)$ as a function of $\hat{\phi}(h, g)$. While the RC curve provides a full picture of the performance of a selective classifier, it is convenient to have a scalar metric that summarizes this curve. A commonly used metric is the \textit{area under the RC curve} (AURC) \citep{ding_revisiting_2020, geifman_bias-reduced_2019}, denoted by $\text{AURC}(h, g)$. %The lower, the better.
However, when comparing selective models, if two RC curves cross, then each model may have a better selective performance than the other depending on the operating point chosen, which cannot be captured by the AURC.
Another interesting metric, which forces the choice of an operating point, is the \textit{selective accuracy constraint} (SAC) \citep{galil_what_2023}, defined as the maximum coverage allowed for a model to achieve a specified accuracy.

Closely related to selective classification is misclassification detection \citep{hendrycks_baseline_2018}, which refers to the problem of discriminating between correct and incorrect predictions made by a classifier. Both tasks rely on ranking predictions according to their confidence estimates, where correct predictions should be ideally separated from incorrect ones. A usual metric for misclassification detection is the area under the ROC curve (AUROC) \citep{fawcett_introduction_2006} which, in contrast to the AURC, is blind to the classifier performance, focusing only on the quality of the confidence estimates. Thus, it may also be used to evaluate confidence estimators for selective classification \citep{galil_what_2023}.

%Misclassification detection \citep{hendrycks_baseline_2018}, which refers to the problem of discriminating between correct and incorrect predictions made by a classifier, is closely related to selective classification. Both tasks rely on ranking predictions according to their confidence estimates, where correct predictions should be ideally separated from incorrect ones. More precisely, if $(x_1, y_1), (x_2, y_2) \in \calX \times \calY$ are such that $\ell(h(x_1), y_1) > \ell(h(x_2), y_2)$, then we would like to have $g(x_1) < g(x_2)$, i.e., an optimal $g$ orders samples in decreasing order of their losses. In the case of the 0/1 loss, a natural metric of ranking performance \citep{galil_what_2023} is the area under the ROC curve (AUROC) \citep{fawcett_introduction_2006} for misclassification detection. This metric is blind to the classifier performance and focuses exclusively on the quality of the confidence estimates, i.e., for a given classifier $h$, different confidence estimators $g$ can be compared in their ranking performance. Thus, misclassification detection can also be seen as a proxy problem on which to evaluate confidence estimators for selective classification.


%\subsection{Calibration}
%\label{section:calibration}
%Consider a classifier $h: \calX \to \calY$ and a confidence estimator $\pi: \calX \to [0,1]$ (which need not be the same function as the confidence estimator $g$ used for selective classification). We say that $\pi$ is \textit{perfectly calibrated} \citep{guo_calibration_2017, gawlikowski_survey_2022} if
%\begin{equation}
%P[h(x) = y \mid \pi(x) = p] = p, \quad \forall p \in [0, 1], \quad (x, y) \sim P.
%\end{equation}
%In practice, empirical measures of calibration are used, based on a test dataset $\{(x_i, y_i)\}_{i=1}^N$ drawn i.i.d.\ from~$P$. The most popular one is arguably the \textit{expected calibration error} (ECE) \citep{naeini_obtaining_2015}, which is computed by grouping predictions into $M$ equal-sized interval bins $B_m = \{i \in \{1, \ldots, N\}: \pi(x_i) \in (\frac{m-1}{M}, \frac{m}{M}]\}$, $m=1,\ldots,M$, and then taking a weighted average of the difference between accuracy and confidence in each bin:
%\begin{equation}
%\label{ECE}
%\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{N} \left| \text{acc}(B_m) - \text{conf}(B_m)\right|
%\end{equation}
%where $\text{acc}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \indicator[h(x_i) = y_i]$ and $\text{conf}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \pi(x_i)$.



\subsection{Confidence Estimation}
\label{section:confidence-estimation}

From now on we restrict attention to classifiers that can be decomposed as $h(x) = \argmax_{k \in \calY} z_k$, where $\bz = f(x)$ and $f: \calX \to \RR^C$ is a neural network. The network output $\bz$ is referred to as the (vector of) \textit{logits} or \textit{logit vector}, due to the fact that it is typically applied to a softmax function to obtain an estimate of the posterior distribution $P[y | x]$. The softmax function is defined as
\begin{equation}
\label{softmax}
\sigma: \RR^C \to [0,1]^C, \quad \sigma_k(\bz) = \frac{e^{z_k}}{\sum_{j=1}^C e^{z_j}}, \;\; k \in \{1, \ldots, C\}
\end{equation}
where $\sigma_k(\bz)$ denotes the $k$th element of the vector $\sigma(\bz)$. 

%\purple{In some occasions, one can only have access to the probabilities, and not to the raw logits. Even though the softmax is a not invertible function, the experiments and methods presented in this work work well even in this case, as discussed in more details in \autoref{sec:logit-transformations}.}

The most popular confidence estimator is arguably the \textit{maximum softmax probability} (MSP) \citep{ding_revisiting_2020}, also known as \textit{maximum class probability} \citep{corbiere_confidence_2021} or \textit{softmax response} \citep{geifman_selective_2017}
\begin{equation}
g(x) = \text{MSP}(\bz) \triangleq \max_{k\in\mathcal{Y}}\, \sigma_k(\bz) = \sigma_{\hat{y}}(\bz)
\end{equation}
where $\hat{y} = \argmax_{k\in\mathcal{Y}} z_k$.
%%The MSP is widely used as a baseline for confidence quantification, both for selective classification %\citep{hendrycks_baseline_2018} and for calibration. %(note that ECE metric defined in Section \ref{section:calibration} is based on the MSP being interpreted as a probability). 
%\blue{As shown by \citet{chow1970optimum} and \citet{franc2023optimal}, if indeed $\sigma_y(\bz) = P[y|x]$ for all $y \in \calY$, then the MSP is the optimal confidence estimator for the 0/1 loss, known in this case as Chow's rule. Thus, in the general case, it emerges as a natural baseline.}
However, other functions of the logits can be considered. Some examples are the \textit{softmax margin} \citep{belghazi_what_2021,lubrano2023simple}, the \textit{max logit} \citep{Hendrycks.etal.2022.Scaling-Out-of-Distribution-Detection}, the \textit{logits margin} \citep{Streeter.2018.Approximation-Algorithms-Cascading,Lebovitz.etal.2023.Efficient-Inference-Model}, the \textit{negative entropy}\footnote{Note that any uncertainty estimator can be used as a confidence estimator by taking its negative.} \citep{belghazi_what_2021}, and the \textit{negative Gini index} \citep{granese2021doctor,Gomes.etal.2022.Simple-Training-Free-Method}, defined, respectively, as%
%\footnote{In the scenario we consider, \textsc{Doctor}'s $D_\alpha$ and $D_\beta$ discriminators \citep{granese2021doctor} are equivalent to the negative Gini index and MSP confidence estimators, respectively, as discussed in more detail in Appendix~\ref{sec:doctor}.}
\begin{align}
\text{SoftmaxMargin}(\bz) &\triangleq \sigma_{\hat{y}}(\bz) - \max_{k \in \calY: k \neq \hat{y}} \sigma_k(\bz) \\
\text{MaxLogit}(\bz) &\triangleq z_{\hat{y}} \\
\text{LogitsMargin}(\bz) &\triangleq z_{\hat{y}} - \max_{k \in \calY: k \neq \hat{y}} z_k \\
\text{NegativeEntropy}(\bz) &\triangleq \sum_{k \in \calY} \sigma_k(\bz) \log \sigma_k(\bz) \\
\text{NegativeGini}(\bz) &\triangleq -1 + \sum_{k \in \calY} \sigma_k(\bz) ^2.
\end{align}
Note that, in the scenario we consider, \textsc{Doctor}'s $D_\alpha$ and $D_\beta$ discriminators \citep{granese2021doctor} are equivalent to the negative Gini index and MSP confidence estimators, respectively, as discussed in more detail in Appendix~\ref{sec:doctor}.

It is worth mentioning that, as shown by \citet{chow1970optimum} and \citet{franc2023optimal}, if indeed $\sigma_y(\bz) = P[y|x]$ for all $y \in \calY$, then the MSP is the optimal confidence estimator for the 0/1 loss, known in this case as Chow's rule. Thus, in the general case, it emerges as a natural baseline.

% \subsection{Temperature Scaling}
% \label{section:TS}

% Temperature scaling (TS) \citep{guo_calibration_2017} is a post-processing method that consists in, for a given trained classifier, transforming the logits as $\mathbf{z'} = \mathbf{z}/T$, before applying the softmax function. The parameter $T$, called the temperature, is then optimized over a hold-out dataset $\{(x_i, y_i)\}_{i=1}^N$ (not used during training of the classifier). An important property of this method is that it does not change the model's predictions. The conventional way of applying TS, as proposed in \citep{guo_calibration_2017} for calibration and referred to here as TS-NLL, consists in optimizing $T$ with respect to the negative log-likelihood (NLL) \citep{murphy_probabilistic_2022}
% \begin{equation}
% \label{NLL}
% \calL = -\sum_{i=1}^N \log \left( (\sigma(\bz_i/T))_{y_i} \right)
% \end{equation}
% where $\bz_i = f(x_i)$. 
