
\subsection{Tunable Logit Transformations}
\label{sec:logit-transformations}

In this section, we introduce a simple but powerful framework for designing post-hoc confidence estimators for selective classification. The idea is to take any parameter-free logit-based confidence estimator, such as those described in Section~\ref{section:confidence-estimation}, and augment it with a logit transformation parameterized by one or a few hyperparameters, which are then tuned (e.g., via grid search) using
a labeled hold-out dataset not used during training of the classifier (i.e. validation data).
Moreover, this hyperparameter tuning is done using as objective function not a proxy loss but rather the exact same metric that one is interested in optimizing, for instance, AURC or AUROC. This approach forces us to be conservative about the hyperparameter search space, which is important for data efficiency. 

\subsubsection{Temperature Scaling}
\label{section:TS}

Originally proposed in the context of post-hoc calibration, temperature scaling (TS) \citep{guo_calibration_2017} consists in transforming the logits as $\mathbf{z'} = \mathbf{z}/T$, before applying the softmax function. The parameter $T>0$, which is called the temperature, is then optimized over hold-out data.

The conventional way of applying TS, as proposed in \citep{guo_calibration_2017} for calibration and referred to here as TS-NLL, consists in optimizing~$T$ with respect to the negative log-likelihood (NLL) \citep{murphy_probabilistic_2022}. %$\calL = -\sum_{i=1}^N \log \left( (\sigma(\bz_i/T))_{y_i} \right)$, where $\bz_i = f(x_i)$.
Here we instead optimize~$T$ using AURC and the resulting method is referred to as TS-AURC.

Note that TS does not affect the ranking of predictions for MaxLogit and LogitsMargin, so it is not applied in these cases.



\subsubsection{Logit Normalization}

Inspired by \citet{wei_mitigating_2022}, who show that logits norms are directly related to overconfidence and propose logit normalization during training, we propose logit normalization as a post-hoc method. Additionally, we extend the normalization from the $2$-norm to a general $p$-norm, where $p$ is a tunable hyperparameter and, similarly to the method proposed in \citep{jiang_normsoftmax_2023}, we propose to \emph{centralize} the logits before normalization.
(For more context on logit normalization, as well as intuition and theoretical justification for our proposed modifications, see the Appendix~\ref{appendix:logit-norm}. For an ablation study on the centralization, see Appendix~\ref{appendix:centralization}.) 
%
Thus, (centralized) logit $p$-normalization is defined as the operation
%\begin{equation}
%\bz' = \frac{\bz}{\tau \|\bz\|_p}
%\end{equation}
\begin{equation}\label{eq:logit-norm}
\bz' = \frac{\bz - \mu(\bz)}{\tau \|\bz-\mu(\bz)\|_p}
\end{equation}
where $\|\bz\|_p \triangleq (|z_1|^p + \cdots + |z_C|^p)^{1/p}$, $p \in \RR$, is the $p$-norm of $\bz$, $\mu(\bz) = \frac{1}{C}\sum_{j=1}^C z_j$ is the mean of the logits, and $\tau > 0$ is a temperature scaling parameter. Note that, when the softmax function is used, this transformation becomes a form of adaptive TS \citep{balanya_adaptive_2022}, with an input-dependent temperature $\tau \|\bz-\mu(\bz)\|_p$.

Logit $p$-normalization introduces two hyperparameters, $p$ and $\tau$, which should be jointly optimized; in this case, we first optimize $\tau$ for each value of $p$ considered and then pick the best value of $p$. This transformation, together with the optimization of $p$ and $\tau$, is here called pNorm.
The optimizing metric is always AURC and therefore it is omitted from the nomenclature of the method.

Note that, when the underlying confidence estimator is MaxLogit or LogitsMargin, the parameter $\tau$ is irrelevant and is ignored.

One key benefit of centralization is that it enables logit $p$-normalization to be applied even if we only have access to the softmax probabilities instead of the original logits. This can be done by computing the logits as $\tilde{\bz} = \log(\sigma(\bz)) = \bz - c$, where $c = \log(\sum_{j=1}^C e^{z_j})$. Then we have
%\begin{equation}
$\tilde{\bz} - \mu(\tilde{\bz}) = \bz - c - \mu(\bz - c) = \bz - \mu(\bz)$
%\end{equation}
from which \eqref{eq:logit-norm} can be computed.

%\purple{One key benefit of the centralization is that it enables the possibility of defining the logits as the logarithm of the softmax probabilities, which can be useful when this is the only output one has. Note that, although the softmax function is not invertible, $\log(\sigma_k(\bz)) - \mu(\log(\sigma(\bz))) = z_k - \mu(\bz)$, and hence the normalization leads to the same results.}


\subsection{Evaluation Metrics}

\subsubsection{Normalized AURC}

A common criticism of the AURC metric is that it does not allow for meaningful comparisons across problems \citep{geifman_bias-reduced_2019}. An AURC of some arbitrary value, for instance, 0.05, may correspond to an ideal confidence estimator for one classifier (of much higher risk) and to a completely random confidence estimator for another classifier (of risk equal to 0.05). The excess AURC (E-AURC) was proposed by \citet{geifman_bias-reduced_2019} to alleviate this problem: for a given classifier $h$ and confidence estimator $g$, it is defined as $\text{E-AURC}(h, g) = \text{AURC}(h, g) - \text{AURC}(h, g^*)$, where $g^*$ corresponds to a hypothetically optimal confidence estimator that perfectly orders samples in decreasing order of their losses. Thus, an ideal confidence estimator always has zero E-AURC.

Unfortunately, E-AURC is still highly sensitive to the classifier’s risk, as shown by \citet{galil_what_2023}, who suggested the use of AUROC instead. However, using AUROC for comparing confidence estimators has an intrinsic disadvantage: if we are using AUROC to evaluate the performance of a tunable confidence estimator, it makes sense to optimize it using this same metric. However, as AUROC and AURC are not necessarily monotonically aligned \citep{ding_revisiting_2020}, the resulting confidence estimator will be optimized for a different problem than the one in which we were originally interested (which is selective classification). Ideally, we would like to evaluate confidence estimators using a metric that is a monotonic function of AURC.

We propose a simple modification to E-AURC that eliminates the shortcomings pointed out in \citep{galil_what_2023}: normalizing by the E-AURC of a random confidence estimator, whose AURC is equal to the classifier’s risk. More precisely, we define the normalized AURC (NAURC) as
\begin{equation}
\text{NAURC}(h, g) = \frac{\text{AURC}(h, g) - \text{AURC}(h, g^*)}{R(h) - \text{AURC}(h, g^*)}.
\end{equation}
Note that this corresponds to a min-max scaling that maps the AURC of the ideal classifier to 0 and the AURC of the random classifier to 1. 
The resulting NAURC is suitable for comparison across different classifiers and is monotonically related to AURC.

\subsubsection{MSP Fallback}
\label{section:fallback}

A useful property of MSP-TS-AURC (but not MSP-TS-NLL) is that, in the infinite-sample setting, it can never have a worse performance than the MSP baseline, as long as $T=1$ is included in the search space. It is natural to extend this property to every confidence estimator, for a simple reason: it is very easy to check whether the estimator provides an improvement to the MSP baseline and, if not, then use the MSP instead. Formally, this corresponds to adding a binary hyperparameter indicating an MSP fallback.

Equivalently, when measuring performance across different models, we simply report a (non-negligible) positive gain in NAURC whenever it occurs. More precisely, we define the \textit{average positive gain} (APG) in NAURC as
\begin{equation}
\text{APG}(g) = \frac{1}{|\calH|}
%|\calH|^{-1}
\sum_{h \in \calH} \left[ \text{NAURC}(h, \text{MSP}) - \text{NAURC}(h, g) \right]^+_\epsilon 
\end{equation}
%where
%\begin{equation*}
%[x]^+_\epsilon = 
%\begin{cases}
%x, & \text{if $x > \epsilon$} \\
%0, & \text{otherwise}
%\end{cases}
%\end{equation*}
where $[x]^+_\epsilon$ is defined as $x$ if $x > \epsilon$ and is $0$ otherwise,
%where 
$\calH$ is a set of classifiers and
$\epsilon > 0$ is chosen so
that only non-negligible gains are reported. 
