%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} 
% after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{pifont}
\usepackage{amsthm}

%My packages
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{graphicx,subfigure}
\usepackage{multirow}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}

\title{
%Trust Uncertainty: Navigating Overconfidence in Out-of-Distribution Detection
%Trust Uncertainty: Capturing Extreme Activations to Mitigate Overconfidence in Out-of-Distribution Detection
Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme Activations
}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:m.azizmalayeri@amsterdamumc.nl
%?Subject=UAI 2024 paper
}{Mohammad Azizmalayeri}{}}
\author[1]{\href{mailto:a.abu-hanna@amsterdamumc.nl}{Ameen Abu-Hanna}}
\author[1,2,3]{\href{mailto:g.cina@amsterdamumc.nl}{Giovanni Cin{\`a}}}
%\author[3]{Further~Coauthor}
%\author[1]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
        Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam UMC, University of Amsterdam, The Netherlands
}
\affil[2]{%
Institute of Logic, Language and Computation, University of Amsterdam, The Netherlands
}
\affil[3]{%
Pacmed, Amsterdam, The Netherlands
  }
  
  \begin{document}
\maketitle

\begin{abstract}
Detecting out-of-distribution (OOD) instances is crucial for the reliable deployment of machine learning models in real-world scenarios.
OOD inputs are commonly expected to cause a more uncertain prediction in the primary task; however, there are OOD cases for which the model returns a highly confident prediction. 
%[[JUST A NOTE: IN STARTISTICS WE DIFFERENATIATE BETWEEN AN ESTIMATE, LIKE A PROBABILITY, AND THE CERTAINTY ABOUT IT MEASURED FOR EXAMPLE AS A CONFIDENCE INTERVAL. FOR EXAMPLE I CAN PREDICT A PROBABILITY OF 0.5 BUT WITH VERY HIGH CONFDENCE, FOR EXAMPLE BECAUSE THE CONF INT IS (0.499, 0.5001). AND I CAN HAVE A PREDICTION OF 0.9 BUT WITH CONF INT OF (0.35, 1.0). IN YOUR PAPER YOU ARE USING THE PROBABILITY ITSELF TO CONVEY CONFIDENCE. IT IS OK JUST BE AWARE OF THE TERMINOLOGY.]]
This phenomenon, denoted as "overconfidence", presents a challenge to OOD detection. 
Specifically, theoretical evidence indicates that overconfidence is an intrinsic property of certain neural network architectures, leading to poor OOD detection.
In this work, we address this issue by measuring extreme activation values in the penultimate layer of neural networks and then leverage this proxy of overconfidence to improve on several OOD detection baselines. 
% The original score is therefore assigned to normal OOD instances, but the score is modified when overconfidence occurs.  
%We conduct extensive experiments on tabular datasets demonstrating the effectiveness of this method across various types of OOD detection methods. 
%We initially build our experiments on the theoretical finding that demonstrated OOD detection is vulnerable to overconfidence in a linear model with synthesized OOD. 
We test our method on a wide array of experiments spanning synthetic data and real-world data, tabular and image datasets, multiple architectures such as ResNet and Transformer, different training loss functions,  and include the scenarios examined in previous theoretical work.
%Next, we expand the experiments to the OOD set selected from a real dataset and architectures like tabular ResNet and Transformer. 
%In addition, we also assess the performance within image datasets. 
%On datasets such as eICU and MIMIC-IV, on average, we could improve the OOD detection performance of a baseline like EBO by $42.1\%$ on synthesized OODs and $45.4\%$ on the real OOD set. 
Compared to the baselines, our method often grants substantial improvements, with double-digit increases in OOD detection AUC, and it does not damage performance in any scenario.

\end{abstract}

%Keywords -> For submission
%OOD Detection, Overconfidence, Data Shift

%TL;DR "Too Long; Didn't Read": a short sentence describing your paper -> For submission

%In order to solve the issue of overconfidence in OOD detection, we propose a new solution involving the capture of extreme activations in the neural network used for post-hoc OOD detection. 

\begin{figure*}[t]
\centering
\includegraphics[width=\textwidth]{images/fig1_v4.png}
%\vskip -5pt
\caption{Visual representation of the proposed method. We measure the $\ell_2$-norm of extreme activation values larger than the threshold $\tau$ (CEA) as an indicator of overconfidence caused by OOD samples and add it to the original novelty score computed based on the probabilities and activation values to generate the final novelty scores.}
\label{fig:fig1}
\end{figure*}

\section{Introduction}\label{sec:intro}
Post-deployment, neural networks may encounter out-of-distribution (OOD) samples coming from a distribution different than that of the training set. This may be due to reasons such as data shift, variations in data collection protocol, and input noise, among others \citep{bandi2018detection, koh2021wilds, zadorozhny2022out}. The predictions on such samples can be unreliable, which causes major concerns for deployment in high-stakes applications. A solution to this problem is OOD detection, which is intended to identify OOD inputs in real time before serving any prediction \citep{yang2021generalized, zimmerer2022mood}.

A common assumption in OOD detection is that machine learning (ML) models are more uncertain about OOD inputs compared to in-distribution (ID) data. This rationale underpins different metrics to identify OOD inputs such as maximum softmax probability (MSP) or entropy \citep{hendrycks2017a}, which in turn relate to different ways in which one can measure uncertainty. However, for several metrics of uncertainty, it has been demonstrated that ML models can return overconfident predictions on some kinds of OOD inputs, e.g., abnormally high softmax confidences \citep{nguyen2015deep}. This can drastically reduce OOD detection performance.

This phenomenon has been theoretically investigated for feed-forward models with ReLU activation function in the studies by \cite{hein2019relu,ulmer2021know}. For OOD instances generated from ID data by scaling a single variable, they prove that the output probability vector can converge to a one-hot vector. As a result, a model employing uncertainty measures like MSP or entropy would be highly overconfident in classifying such  OOD cases as ID.

In this work, we address the overconfidence problem in OOD detection methods. For this purpose, we propose to adjust the novelty score—the score assigned to each input to classify it as OOD or ID—by adding a second term responsible for capturing overconfidence.
Inspired by the observation that OOD inputs cause outsized activation values in the neural networks \citep{sun2021react}, we suggest 
defining this term by capturing extreme activation values (CEA). In practice, this involves taking the $\ell_2$-norm of extreme activation values (defined as surpassing a specified threshold) at the penultimate layer of models. If the threshold is chosen appropriately over a validation set, ID activations remain below the cutoff, and the term only captures overconfidence caused by OOD samples. 
%[[THIS WAS AN IMPORTANT SENTENCE. IS IT A FACT? HYPOTHESIS? I THINK I CAN FIND CORNER EXAMPLES IN WHICH THE ID PROBABILITY IS CLOSE TO 1]]. 
Our method is displayed in Fig. \ref{fig:fig1}. 
%[[BY THE WAY I THINK IN FUTURE WORK ONE COULD TRY TOI ADAPT THE THRESHOLD BASED ON THE INSTANCE, FOR EXAMPLE ITS DISTANCE FROM OTHER DATA POINTS. NOW IT IS "ONE SIZE FITS ALL"]].

To assess the effectiveness of the proposed method, we adopt the experimental settings outlined in \cite{ulmer2021know}, as the authors define the context stated earlier involving piece-wise linear models and synthetic OOD, where the impact of overconfidence on OOD detection becomes evident.
%This study theoretically proves that OOD detection fails due to overconfidence in linear feed-forward models with the ReLU activation function for OOD instances generated by scaling a single variable from ID data.
In addition,
%[[WHAT WAS THE ARCHITECTURE IN THIS MAIN SETTING?]]
we experiment with alternative architectures and OOD data such as tabular ResNet and Transformer and real-world OOD sets, where overconfidence may not occur. 
Our experiments also include models trained with LogitNorm \citep{wei2022mitigating}, a custom loss mitigating overconfidence. Furthermore, we conduct experiments with image data to evaluate how the results extend to other modalities.

Results demonstrate that many baseline OOD detection methods can benefit from CEA as it remarkably enhances the OOD detection performance of several baselines across different settings. For example, averaged results over 5 different tabular datasets indicate a $45.6\%$ improvement in the AUC of detecting synthesized OODs by MSP. Moreover, MSP performance on average increases by $41.1\%$  in the experiments with a real-world tabular OOD set. Our findings also shed light on the different factors influencing overconfidence such as network architecture and ID data heterogeneity.
%In summary, we address the overconfidence issue in OOD detection by capturing the extreme activation values at the penultimate layer of the neural networks. 
%To sum up, many baseline OOD detection methods can benefit from this approach as it 
The advantage of our proposal is that it can be incorporated into any method without requiring any change to the original technique or adding much computational overhead. Consequently, this method can potentially be applied to a variety of settings, improving the reliability of OOD detection methods across the board. All experiments are fully reproducible and the code is provided open access\footnote{\url{https://github.com/mazizmalayeri/CEA}.}.

\section{Preliminaries and Related Work}

\textbf{OOD detection:} For identifying OOD instances, we require a function $f$ assigning larger novelty scores to OOD inputs compared to ID ones and a threshold $\beta$ such that:
\begin{equation}\label{Eq:OOD_Detection}
    G(x, f, \beta) = \begin{cases}
      \text{OOD} \quad\quad f(x)\geq\beta\\
      \text{ID} \quad\quad\quad f(x)<\beta
    \end{cases}.
\end{equation}
Common choices for $f$ are methods that train a new model to estimate the distribution of ID data such as approaches based on an auto-encoder \citep{zhou2022rethinking}, and post-hoc detection methods, which are elaborated on below. In this study, the latter are of particular interest since they suffer from the impact of overconfidence in the generated novelty score.

\textbf{Post-hoc OOD detection:} 
Assuming that a model is trained for a certain task (which could be anything, from sentiment classification to mortality prediction), post-hoc methods can be employed to identify OOD inputs without retraining a new model, which makes them an appealing choice in many applications \citep{yang2021generalized}.
The novelty score is often generated based on the class probabilities or the internal representations of the pre-trained neural network. For example, EBO \citep{liu2020energy} utilizes an energy score instead of a softmax score since it aligns with the probability density of inputs and suffers less from overconfidence, or MDS \citep{lee2018simple} measures the distance of each input from class-conditional Gaussian distributions in the feature space.
More examples can be found in Appendix \ref{apd:baselines}.

\textbf{Overconfidence in OOD detection:}
For the remainder of this paper, we take the maximum softmax probability of the predicted classes as the measure of confidence of the model (we employ certainty and confidence as synonyms).
% The overconfidence issue generally refers to the phenomenon of having a high level of confidence (probability) in the class predicted by a model for its primary task for an OOD instance. 
With \textit{overconfidence} we refer to the phenomenon of having a level of confidence in the predicted class that increases as we move away from ID data in the feature space. This in turn engenders a decrease in OOD detection AUC as we transition away from ID data, contrary to what we would want.
In addition to the softmax scores, this phenomenon can extend to the intermediate layers as demonstrated in \cite{sun2021react}: activation values in the internal layers of a neural network have a different pattern for OOD data. This can reduce the performance of post-hoc detection methods. Among the methods provided for OOD detection, ReAct \citep{sun2021react} and LogitNorm \citep{wei2022mitigating} are specifically designed to mitigate this problem. 

ReAct addresses this issue by capping the activation values in the intermediate layers of neural networks at an upper limit, thereby the overconfident values will not affect the final prediction and novelty score.  Despite the advantages of this method, we argue that it may lose useful information. Accordingly, in contrast to ReAct, we suggest retaining those values when generating the novelty score, but adjusting the novelty score based on the $\ell_2$-norm of extreme activation values (CEA).

LogitNorm is not an OOD detection method, but a loss function designed to alleviate the overconfidence of neural networks. For this purpose, motivated by the insight that increased logit norm during training with softmax cross-entropy loss induces overconfidence, the authors train the model by enforcing a constant norm on the logit vector. Note that this method requires training a new model with a constrained loss that may impact the optimization of the model for the original task.

\textbf{Theoretical vulnerability of OOD detection:} The study by \citet{ulmer2021know} gives a theoretical explanation of why OOD detection fails under overconfidence. To achieve this, they utilize a known result that feed-forward neural networks with piece-wise linear activation functions partition the input space into polytopes \citep{arora2018understanding}. They then use the fact that these networks are component-wise strictly monotonic on each of their polytopes \citep{croce2019randomized, hein2019relu}. They also prove that if we scale a single variable in an input with a factor $\alpha$, there exists a value $\delta$ such that $\forall\ \alpha >\delta$, the output always lies in a specific polytope.

Keeping in mind that we stay within a single polytope for any large $\alpha$ and acknowledging the monotonic nature of the polytopes, it is proved under certain conditions that if we scale a variable from the input by $\alpha\rightarrow\infty$, the softmax output of the defined neural network converges to a one-hot vector. Using this finding, in their Theorem 1, they conclude that OOD detection fails if we measure uncertainty with metrics like MSP and entropy,  as these methods assign smaller novelty scores to such OOD instances compared to ID ones. Furthermore, it has been empirically observed that this phenomenon occurs at a limited $\alpha$ as well \citep{azizmalayeri2023unmasking}. 

\section{Method}\label{sec:method}
%Despite the growing success of post-hoc OOD detection methods, they may perform even worse than a random binary classifier due to having a higher level of confidence in OOD instances. 
In this section, we introduce CEA, a method that addresses the problem of overconfidence by modifying the novelty scores based on the outsized activation values. For this purpose, we describe below how the confidence level can be integrated into the OOD detection setup. 
%Next, we will continue by providing a simple criterion to measure the confidence level.
%Lastly, we define the experiment setup in which we investigate the effectiveness of the method.

\subsection{Considering Overconfidence in Novelty Score}
%Prediction models are generally expected to be more uncertain about OOD inputs compared to ID data, which is a basic idea in OOD detection. However, it has been shown that ML models can have an overconfident prediction on some kinds of OOD inputs, i.e., abnormally high softmax confidences \citep{nguyen2015deep}. This can drastically reduce the performance of post-hoc OOD detection methods.

In order to deal with overconfidence in OOD detection, we suggest directly taking it into account as part of the novelty score generation. 
% A simple solution may be to employ the novelty score only when overconfidence is not triggered. 
%As this issue is not consistent across all OOD inputs, it is not feasible to condition the novelty score on the overconfidence or to directly use it for identifying the OOD inputs. 
%Instead, we propose adding a new term to the novelty score which triggers only when overconfidence occurs. In this manner, when an OOD input causes overly confident prediction, the new term would be activated; otherwise, the original novelty score would be retained.
We propose adding a new term to the novelty score which is non-zero only when an OOD input causes overly confident prediction; otherwise, the new term remains close to zero and the original novelty score would be retained.
In summary, we propose to change Eq. \ref{Eq:OOD_Detection} to:
\begin{equation}
    G(x, f, g, \beta) = \begin{cases}
      \text{OOD} \quad (f(x)+\lambda\ g(x))\geq\beta\\
      \text{ID} \quad\quad (f(x)+\lambda\ g(x))<\beta
    \end{cases},
\end{equation}
where $\beta$ is a threshold for classifying a sample as OOD, $f(x)$ returns the novelty score,  $g(x)$ is the new term responsible for indicating overconfidence, and $\lambda$ controls the tradeoff between $f$ and $g$. The function $g(x)$ should have the following characteristics:
\begin{itemize}[itemsep=0pt,parsep=0pt, topsep=0pt, itemindent=0pt, leftmargin=10pt]
\item The value returned by $g(x)$ for ID data should be smaller or equal to the value returned by $g(x)$ for OOD data.
%\item The value returned by $g(x)$ for OOD data leading to overconfident predictions should be larger or equal to the value returned by $g(x)$ for other OOD cases.
\item The value returned by $g(x)$ should monotonically increase as the overconfidence level rises, e.g., when amplifying the scaling factor $\alpha$ for synthesizing the OOD instances.
\end{itemize}
The first condition guarantees that the addition of $g(x)$ will not adversely impact the performance of the original novelty score $f(x)$ for OOD detection, and the second one is directed towards the primary objective of introducing g(x), namely highlighting the existence of overconfidence.
Note that $g(x)$ alone may not be sufficient for OOD detection as OOD instances without overconfidence will not be spotted. One may also think of using $g(x)$ as a trigger for $f(x)$, meaning that the latter is used only when the former does not trigger. This approach however requires an additional hyperparameter to decide when $g$ would raise a flag.


\subsection{Overconfidence Measure}\label{sec:overconfidence_measure}

In this section, we present a choice for $g(x)$ that meets the specified conditions and can be applied to any architecture. It has been observed that OOD data can lead to activation patterns in neural networks that are significantly different from ID data, i.e., activation units with extremely large values, which results in overconfident predictions \citep{sun2021react}. We demonstrate in the subsequent theorem that one kind of overconfident behavior on the side of the model entails the presence of extreme activations in the penultimate layer.
%\textbf{Theorem 1.}
\begin{theorem}
Let $x\in R^D$ and suppose $\alpha$ is a scaling vector.  Now $x'= \alpha \odot x$ can be considered as an OOD example if $\alpha$ is large enough. Let $h_\theta$ be any neural network whose last layer is linear, generating an overconfident prediction for class $c$ on $x'$ as:
\begin{equation}
    \lim_{{\alpha_d \to \infty}} \sigma(h_\theta (x'))_c = 1,
\end{equation}
where $\sigma$  is the Softmax function. Then, we infer that there exists at least a dimension in which the output of the penultimate layer goes to infinity in the limit.
\end{theorem}
%\textit{Proof.} 
\begin{proof}
The proof is available in Appendix \ref{apd:proof}.
\end{proof}

This finding suggests that extreme activations in the penultimate layer can be an indicator of overconfidence in the prediction. Hence, we can measure the magnitude of extreme activations, denoted as CEA, as a proxy for $g(x)$. 
To this end, we use the $\ell_2$ norm of node activation values at the penultimate layer of the neural network that are higher than a specified threshold. Accordingly, assuming that $k_\theta(x)$ is the activation vector before the classification layer generated by the prediction model $k$ with parameters $\theta$ for the input $x$, we define CEA as:
\begin{equation}
   CEA(x, k_\theta, \tau) = {\lVert max(k_\theta(x)-\tau, 0) \rVert}_2\ ,
\end{equation}
where $\tau$ is the specified threshold. We utilize $\ell_2$-norm in our method, but it can potentially be substituted with other norms as well. The pseudocode for computing CEA as a proxy of $g(x)$ and adding it to the original novelty score is provided in Algorithm \ref{alg:g(x)}. 
This selection for $g(x)$ intuitively yields larger values for the overconfident OOD inputs than other OODs as they lead to more outsized activation nodes. Also, 
with a suitable choice of $\tau$, the values returned for ID data would be comparatively smaller.
Therefore, by appropriately selecting hyperparameters, the CEA algorithm can fulfill the specified conditions. 
%In the following, we discuss the choice for the hyper-parameters in the algorithm.

\begin{algorithm}[t]
\caption{Simple code for the proposed method.}\label{alg:g(x)}
\begin{algorithmic}
\State \textbf{Input:} Prediction model $k_\theta$, sample $x$, OOD detection method $f$.
\State \textbf{Parameters:} Coefficient $\lambda$, Threshold $\tau$.
\vspace{0.5\baselineskip}
\State $x_{activations}, x_{logits} = k_\theta(x)$ 
\begin{scriptsize}\Comment{Activations in penultimate layer.}\end{scriptsize}
\State $NS= f(x_{activations}, x_{logits})$ 
\begin{scriptsize}\Comment{Original novelty score ($f(x)$).}\end{scriptsize}
\State $CEA= max(x_{activation}-\tau, 0)$ 
\begin{scriptsize}\Comment{Capturing extreme activations.}\end{scriptsize}
\State $CEA={\lVert CEA\rVert}_2$
\begin{scriptsize}\Comment{$\ell_2$-norm of extreme values ($g(x)$).}\end{scriptsize}
\State $NS = NS + \lambda\ CEA$
\begin{scriptsize}\Comment{Modifying $NS$ based on $CEA$.}\end{scriptsize}
\vspace{0.5\baselineskip}
\State \textbf{Output:} Modified novelty score $NS$.
\end{algorithmic}
\end{algorithm}

\textbf{Hyper-parameter selection:} $\tau$ and $\lambda$ play an important role in the proposed method. We select threshold $\tau$ such that it remains above the feature values of ID data. Hence, $g(x)$ is close to zero for ID data, while it can capture outsized feature values in OOD instances. To tune these values, we use a validation set from ID data $\mathcal{D}_{val}$ to extract their activation values at the penultimate layer of the prediction model. $\tau$ can be set to the maximum activation value extracted from the validation data; however, in presence of outliers, such a choice might lead to a very large $\tau$ that does not let $g(x)$ capture the overconfidence even in OOD cases. Alternatively, we use the activation value at the $p$-th percentile to avoid noisy activation values. 
%Notably, $p=100$ is equivalent to utilizing the largest value. 
In our study, we set $p=99.9$ for tabular datasets and $p=99.999$ for images.  Furthermore, we also scale this value by a factor of $\rho=1.1$ to ensure that most ID feature values remain below the threshold.

The coefficient $\lambda$ is determined based on the average $f(x)$ and $g(x)$ over  $\mathcal{D}_{val}$. More specifically, $\lambda$ is computed as:
\begin{equation}
    \lambda = \gamma\ \lvert\dfrac{\sum_{x\in \mathcal{D}_{val}} f(x)}{\sum_{x\in \mathcal{D}_{val}} g(x)}\rvert\ ,
\end{equation}
where $\gamma$ lets us control the tradeoff between $f(x)$ and $g(x)$. In our study, we set $\gamma=1$. We conduct an ablation study on the parameters $p$ and $\gamma$ in the experiments.

It should be noted that we only use a validation from ID data to find the hyper-parameters. Still, we could tune $\tau$ and $\lambda$ further using a diverse set of OOD examples. This involves assessing the OOD detection performance across different ranges of these parameters. However, this requires a set of diverse OOD data which covers different kinds of OOD examples that model may face in practice, which is not available in many cases. %Thus, we opt to limit our parameter tuning to the validation set from ID data. Nonetheless, if access to such an OOD dataset is available, it could aid in identifying optimal parameters.

\begin{figure*}[ht]
\centering
\begin{subfigure}{}
\centering
     \includegraphics[width=\linewidth]{images/main_eicu.png}
\end{subfigure}
\\
\begin{subfigure}{}
\centering
\vskip -10pt
         \includegraphics[width=\linewidth]{images/main_diabetics.png}
\end{subfigure}
\vskip -15pt
\caption{OOD detection performance with and without CEA using the eICU (top) or Diabetics (bottom) datasets as ID  and synthesized OOD data obtained by scaling. The blue bars are positioned in front of the red ones and cross markers are employed to emphasize the top of the red bars. The scaling factors $\alpha$ and baselines are presented under each bar.}
    \label{fig:tabular}
\end{figure*}

\begin{figure*}[ht]
    \centering
        \includegraphics[width=\textwidth]{images/mimic_vs_eicu.png}
        \caption{OOD detection performance with and without CEA using MIMIC-IV as ID and eICU as OOD (left) and the other way around (right). The blue bars are positioned in front of the red ones and cross markers are employed to emphasize the top of the red bars.}
    \label{fig:mimic_vs_eicu}
\end{figure*}



% Please add the following required packages to your document preamble:
% \usepackage{multirow}
\begin{table*}[ht]
\centering
\setlength{\tabcolsep}{7.3pt}
\caption{
AUC of OOD detection with and without CEA using tabular ResNet and Transformer as the prediction model. We use eICU and Diabetics as ID and synthesize the OOD data by scaling factor $\alpha$. Superior results are emphasized in bold unless the two are equal.}
\begin{tabular}{ccccc|ccc}
\toprule
\multicolumn{1}{l}{}                            &                            & \multicolumn{3}{c|}{ResNet}                                                  & \multicolumn{3}{c}{Transformer}                                             \\ \cmidrule{3-8} 
\multicolumn{1}{l}{}                            & \multicolumn{1}{c}{}       & \multicolumn{1}{c}{$\alpha=10$} & \multicolumn{1}{c}{$\alpha=100$} & \multicolumn{1}{c|}{$\alpha=1000$} & \multicolumn{1}{c}{$\alpha=10$} & \multicolumn{1}{c}{$\alpha=100$} & \multicolumn{1}{c}{$\alpha=1000$} \\ \cmidrule{3-8} 
ID                     & Method                     & \multicolumn{6}{c}{Baseline / Baseline\hskip0.75pt\&\hskip0.75pt CEA}                                                                                                                          \\ \midrule
\multicolumn{1}{l|}{\multirow{7}{*}{eICU}}      & \multicolumn{1}{l|}{MDS}   & \textbf{77.8} / 77.6              & 91.7 / \textbf{91.8}               & 93.6 / 93.6                 & 63.3 / 63.3              & 83.6 / 83.6               & 90.7 / 90.7                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{KNN}   & 72.0 / \textbf{74.7}              & 86.8 / \textbf{90.5}               & 89.5 / \textbf{93.3}                 & 59.9 / 59.9              & 79.5 / 79.5               & 90.1 / 90.1                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{ViM}   & 75.4 / 75.4              & 91.4 / 91.4               & 93.7 / 93.7                 & 60.0 / 60.0              & 80.5 / 80.5               & 90.3 / 90.3                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{MSP}   & 47.9 / \textbf{69.5}              & 30.6 / \textbf{87.3}               & 13.2 / \textbf{93.6}                 & 51.7 / \textbf{52.5}              & 56.1 / \textbf{58.3}               & 71.7 / \textbf{73.5}                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{EBO}   & 46.4 / \textbf{69.2}              & 28.6 / \textbf{87.1}               & 13.2 / \textbf{93.6}                 & 51.6 / \textbf{52.3 }             & 56.1 / \textbf{57.9}               & 71.4 / \textbf{73.0}                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{ReAct} & 61.6 / \textbf{70.3}              & 71.8 / \textbf{88.0}               & 76.1 /\textbf{93.6}                 & 51.9 / \textbf{52.5}              & 56.6 / \textbf{58.3 }              & 72.0 / \textbf{73.7}                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{Gram}  & 35.4 / \textbf{55.0}              & 16.0 / \textbf{44.8 }              & 6.8 / \textbf{24.0}                  & 50.8 / 51.7\textbf{}              & 54.4 / \textbf{57.0}               & 68.3 / \textbf{69.9}                \\ \midrule
\multicolumn{1}{l|}{\multirow{7}{*}{Diabetics}} & \multicolumn{1}{l|}{MDS}   & \textbf{85.9} / 85.8              & 90.2 / \textbf{90.3 }              & 91.8 / 91.8                 & 84.3 / 84.3              & 89.5 / 89.5               & 91.3 / 91.3                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{KNN}   & 80.0 / \textbf{83.9}              & 85.0 / \textbf{89.7}               & 87.0 / \textbf{91.9}                & 80.3 / 80.3              & 88.3 / 88.3               & 90.8 / 90.8                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{ViM}   & 79.2 / \textbf{84.9}              & 83.9 /\textbf{90.8}               & 85.5 / \textbf{92.5}                 & 84.2 / 84.2              & 90.6 / 90.6               & 92.2 / 92.2                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{MSP}   & 25.6 / \textbf{67.4}              & 15.5 / \textbf{86.3 }              & 10.5 / \textbf{90.3}                 & 38.4 / \textbf{39.9}              & 47.3 / \textbf{48.2}              & 61.8 / \textbf{62.7}                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{EBO}   & 35.1 / \textbf{68.4}              & 23.0 / \textbf{86.1}               & 18.5 / \textbf{90.2}                 & 43.0 / \textbf{44.0}              & 50.8 / \textbf{51.4}               & 66.8 / \textbf{68.1}                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{ReAct} & 44.3 / \textbf{69.9}              & 43.4 / \textbf{86.6}               & 42.8 / \textbf{90.4}                 & 45.8 / \textbf{46.9}              & 52.2 / \textbf{52.9}               & 66.6 / \textbf{68.1}                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{Gram}  & 21.5 / \textbf{65.2}              & 11.9 / \textbf{74.8}               & 6.7 / \textbf{38.8}                  & 50.6 / \textbf{52.1}              & 54.2 / \textbf{54.7}              & 64.3 / \textbf{65.1}               \\ \bottomrule
\end{tabular}
\label{tab:artitechtures}
\end{table*}

\begin{table*}[ht]
\centering
\setlength{\tabcolsep}{7.3pt}
%\vskip 5 pt
\caption{
AUC of OOD detection with and without CEA on the model trained with the LogitNorm loss.
We use eICU and Diabetics as ID and synthesize the OOD data by scaling factor $\alpha$. Superior results are emphasized in bold unless the two are equal.}
\begin{tabular}{ccccccccc}
\toprule
\multirow{2}{*}{ID}   & \multirow{2}{*}{$\alpha$} & MDS       & KNN       & ViM       & MSP       & EBO       & ReAct     & Gram \\ \cmidrule{3-9} 
                           &                                        & \multicolumn{7}{c}{Baseline / Baseline\hskip0.75pt\&\hskip0.75pt CEA}                                                                \\ \midrule
\multirow{3}{*}{eICU}      & 10                                     & \textbf{72.8} / 72.6 & 57.7 / \textbf{68.2} & 45.9 / \textbf{59.2} & 55.0 / \textbf{67.3} & 55.0 / \textbf{67.3} & 53.5 / \textbf{65.8} & 37.1 / \textbf{60.1}                \\
                           & 100                                    & 85.6 / \textbf{85.7} & 63.9 / \textbf{82.4} & 43.6 / \textbf{75.4} & 61.7 / \textbf{82.7} & 61.7 / \textbf{82.7} & 55.1 / \textbf{80.4} & 21.7 / \textbf{62.6}                \\
                           & 1000                                   & 90.5 / 90.5 & 66.1 / \textbf{89.2} & 42.6 / \textbf{87.2} & 64.6 / \textbf{89.9} & 64.7 / \textbf{90.0} & 54.2 / \textbf{89.2} & 11.3 / \textbf{42.7}                \\ \midrule
\multirow{3}{*}{Diabetics} & 10                                     & 84.7 / 84.7 & 82.9 / \textbf{83.4} & 84.2 / 84.2 & 35.2 / \textbf{65.0} & 35.2 / \textbf{65.0} & 20.3 / \textbf{61.0} & 23.4 / \textbf{51.6}                \\
                           & 100                                    & 89.6 / 89.6 & 88.3 / \textbf{89.1} & 89.7 / 89.7 & 32.6 / \textbf{85.7} & 32.6 / \textbf{85.7} & 11.9 / \textbf{84.2} & 12.5 / \textbf{55.0}                \\
                           & 1000                                   & 91.7 / 91.7 & 90.5 / \textbf{91.3} & 91.9 / 91.9 & 31.7 / \textbf{90.0} & 31.9 / \textbf{90.1} & 9.1 / \textbf{89.1}  & 9.4 / \textbf{29.2}    \\ \bottomrule            
\end{tabular}
\label{tab:logitnorm}
\end{table*}


\begin{table*}[ht]
\centering
\setlength{\tabcolsep}{7.3pt}
\caption{AUC of OOD detection with and without CEA in image datasets. MNIST and CIFAR-10 serve as ID, and OOD sets are synthesized by i) scaling or ii) an adversarial attack, or iii) selected from other datasets. ResNet-32 and ReLU MLP classifiers are used as the prediction model. Superior results are in bold unless the two are equal.}
\begin{tabular}{ccccc|ccc}
\toprule
\multicolumn{1}{l}{}                           &                            & \multicolumn{3}{c|}{ReLU MLP}                                                      & \multicolumn{3}{c}{ResNet-32}                                                     \\ \cmidrule{3-8} 
\multicolumn{1}{l}{}                           & \multicolumn{1}{c}{}       & \multicolumn{1}{c}{Scale} & \multicolumn{1}{c}{Attack} & \multicolumn{1}{c|}{Other} & \multicolumn{1}{c}{Scale} & \multicolumn{1}{c}{Attack} & \multicolumn{1}{c}{Other} \\ \cmidrule{3-8} 
ID                   & Method                     & \multicolumn{6}{c}{Baseline / Baseline\hskip0.75pt\&\hskip0.75pt CEA}                                                                                                                                      \\ \midrule
\multicolumn{1}{l|}{\multirow{7}{*}{MNIST}}    & \multicolumn{1}{l|}{MDS}   &  64.2 / \textbf{64.3} & \textbf{98.5} / 98.1 & 88.7 / \textbf{90.2}                  & 59.5 / 59.5                 & 99.9 / 99.9                & 99.9 / 99.9              \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{KNN}   &  62.1 / \textbf{62.4} & 73.1 / \textbf{84.6} & 97.6 / \textbf{98.2}                  & 54.6 / 54.6                 & 99.2 / \textbf{99.7}                  & 99.9 / 99.9              \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{ViM}   & 60.2 / 60.2 & 67.4 / \textbf{68.0} & 98.0 / 98.0                 & 58.3 / 58.3                 & 99.9 / 99.9                & 99.9 / 99.9              \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{MSP}   &  46.3 / \textbf{63.2} & 26.1 / \textbf{73.1} & 77.5 / \textbf{89.9}                  & 52.5 / \textbf{54.3}                 & 59.7 / \textbf{97.3 }                 & 98.3 / \textbf{98.7}                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{EBO}   &  41.3 / \textbf{61.3} & 7.0 / \textbf{44.2}  & 74.7 / \textbf{92.3}               & 47.5 / \textbf{67.7}                 & 11.7 / \textbf{92.8}                 & 95.5 / \textbf{97.0 }               \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{ReAct} & 56.4 / \textbf{60.3} & 28.7 / \textbf{67.7}& 88.8 / \textbf{93.6}                 & 60.9 / \textbf{61.0}                 & 86.7 / \textbf{97.8}                  & 98.6 / \textbf{98.8}                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{Gram}  & 34.1 / \textbf{48.9} & 2.8 / \textbf{12.8}  & 26.8 / \textbf{36.4}                 & 43.8 / \textbf{44.8}                 & 1.6 / \textbf{71.2}                   & 53.0 / \textbf{62.0}                \\ \midrule
\multicolumn{1}{l|}{\multirow{7}{*}{CIFAR-10}} & \multicolumn{1}{l|}{MDS}   & 97.9 / 97.9                 & 95.4 / \textbf{95.5}                  & \textbf{61.8} / 61.7                 & 99.8 / 99.8                 & 98.0 / \textbf{99.1}                  & 31.1 / 31.0                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{KNN}   & 77.7 / \textbf{97.4}                 & 65.3 / \textbf{80.4}                  & 54.5 / \textbf{56.4}                 & 99.5 / \textbf{99.6}                 & 10.0 / \textbf{71.1}                  & 86.9 / \textbf{87.0}                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{ViM}   & 71.2 / \textbf{95.2}                 & 43.3 / \textbf{50.5}                  & 63.5 / \textbf{64.3}                 & 70.6 / \textbf{95.6}                 & 0.0 / \textbf{58.6}                   & 90.4 / 90.4                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{MSP}   & 14.3 / \textbf{90.0}                 & 11.0 / \textbf{39.0}                  & 57.3 / \textbf{59.1}                 & 88.0 / \textbf{99.3}                 & 0.2 / \textbf{74.4}                   & 88.4 / \textbf{88.7}                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{EBO}   & 3.9 / \textbf{4.4 }                  & 5.9 / \textbf{6.3}                    & 50.0 / 50.0                 & 78.0 / \textbf{96.7}                 & 0.0 / \textbf{59.0}                   & 90.4 / 90.4                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{ReAct} & 71.6 / \textbf{94.4}                 & 72.2 / \textbf{81.3}                  & 55.5 / \textbf{57.7}                 & 97.5 / \textbf{99.0 }                & 13.0 / \textbf{73.4}                  & 81.8 / \textbf{82.0}                \\
\multicolumn{1}{c|}{}                          & \multicolumn{1}{l|}{Gram}  & 22.3 / \textbf{94.1}                 & 8.2 / \textbf{25.9}                   & 61.6 / \textbf{63.1}                 & 13.8 / 13.8                 & 0.0 / \textbf{1.6}                    & 70.4 / 70.4    \\ \bottomrule           
\end{tabular}
\label{tab:images}
\end{table*}

\subsection{Experimental Setup}\label{sec:experiment_setup}

To investigate the effectiveness of our proposed method in contrast to baseline OOD detection methods, we follow the theoretical findings described earlier, where the neural network has a provably higher level of confidence for some kinds of OOD cases compared to IDs. Accordingly, we incorporate the assumptions used in Theorem 1 of \citet{ulmer2021know}
%. More specifically, we consider the prediction model and OOD groups 
as follows:
\begin{itemize}[itemsep=0pt,parsep=0pt, topsep=0pt, itemindent=0pt, leftmargin=10pt]
    \item A ReLU classifier constituted of fully connected layers with ReLU activation function is considered as the basic architecture of the prediction model. This architecture is a piece-wise affine function that leads to overconfident predictions as discussed.

    \item OOD instances are generated by scaling a single input variable from the ID data by an $\alpha$ factor. We repeat synthesizing OOD versions 50 times with different variables and average the results over them to minimize the effect of the selected variable. If a dataset has a smaller number of variables, we use each of them once. Moreover, we use different values of $\alpha$ to see how that affects the results. We are interested in such OODs since for large $\alpha$ they lead to overconfident predictions according to the theory.
\end{itemize}

The experiments proceed as follows. First, a prediction model is trained on a specific dataset. Then, post-hoc OOD detection methods with and without our method are applied to the prediction model to examine whether they can discriminate OOD instances from IDs and the extent to which they suffer from overconfident predictions. 
AUC is used as the discrimination criterion and the results are averaged over three repetitions of experiments with different randomization seeds to increase reliability.
We investigate more complicated architectures and some other OOD groups in the experiments as well, to see how the results extend to those scenarios. In the following, we provide details about these architectures and OODs along with the datasets and OOD detection baselines used in the experiments. 

\textbf{Tabular datasets:} We consider 5 different tabular datasets consisting of eICU \citep{pollard2018eicu}, MIMIC-IV \citep{johnson2023mimic}, Diabetic Retinopathy Debrecen \citep{balint2014diabetic}, Dry Bean \citep{UCI2020bean}, and Wine Quality \citep{Paulo2009wine}. These datasets are drawn from different domains and aim at diverse prediction tasks such as mortality prediction or wine type. In addition, they include large-scale and unbalanced (e.g., eICU) datasets. Detailed information about the datasets can be found in Appendix \ref{apd:datasets}. For the first three datasets, we use $\alpha \in [10, 100, 1000]$ in synthesizing the OOD instances, while opting for $\alpha \in [2, 3, 4]$ for the remaining ones as they exhibit overconfident behavior at a smaller scale.

\textbf{Real-world tabular data as OOD:} Besides synthesized OOD groups, it is valuable to examine real-world OOD instances as well. Among the tabular datasets mentioned above, eICU and MIMIC-IV datasets share a lot of similar variables. Consequently, we can employ each of them as the OOD data for the other one. For this purpose, we only use the shared variables in these two datasets. This experiment is also interesting in that it explores the effect of data homogeneity on overconfidence: as opposed to eICU, which comprises the data of several hospitals, MIMIC-IV is more homogeneous since it sources data from a single center.

\textbf{Architectures:} In addition to the ReLU classifier, we also employ tabular ResNet and Transformer \citep{gorishniy2021revisiting} in the experiments. It has been empirically observed that tabular Transformers can mitigate the overconfidence phenomenon \citep{azizmalayeri2023unmasking}, which allows us to evaluate our method in combination with a prediction model that suffers from this issue only marginally.

\textbf{Image datasets:} In addition to tabular datasets, we also consider three widely used image datasets MNIST \citep{deng2012mnist}, CIFAR-10, CIFAR-100 \citep{krizhevsky2009learning}.
%, and ImageNet-1k \citep{deng2009imagenet}. 
We train a ReLU classifier and a ResNet-32 model on each of these datasets. Within images, we only set $\alpha=10$ for the synthesized OODs since it is enough to show overconfidence, and also use adversarial examples generated by maximizing the cross-entropy loss of the model itself via a PGD-20 attack with $\epsilon=32/255$ as another way of generating overconfident OOD instances \citep{nguyen2015deep, madry2018towards}. Furthermore, we also report the average detection performance on real-world OOD datasets, namely Fashion MNIST \citep{xiao2017fashion}, MNIST, SVHN \citep{Netzer2011SVHN}, CIFAR-10, and CIFAR-100 (the one used as ID is excluded in the averaging).
%For the first three low-resolution datasets, we consider Fashion MNIST \citep{xiao2017fashion}, MNIST, SVHN \citep{Netzer2011SVHN}, CIFAR-10, and CIFAR-100 as OOD (except the one that is used as ID), and for ImageNet, we use iNaturalist \citep{van2018inaturalist}, Textures \citep{cimpoi2014describing}, NINCO \citep{bitterwolf2023or}, and OpenImage-O \citep{wang2022vim}.

%\textbf{Evaluation Criteria:} Following the studies on OOD detection, we use the Area Under the ROC Curve (AUC) as a standard measure for evaluating the performance in the binary classification of ID and OOD instances. AUC is in the range of [$0, 1$], with a higher value showing better performance.

\textbf{Baseline OOD detection methods:} A wide range of post-hoc OOD detection baselines are considered in this study following recent benchmarks in OOD detection \citep{yang2022openood, zhang2023openood, azizmalayeri2023unmasking}. These baselines include both commonly used and top-performing post-hoc OOD detection methods.
%that generate a novelty score based on the class probabilities or the internal representations of a prediction model. 
More information about these methods can be found in Appendix \ref{apd:baselines}. In addition, we also consider the same baselines applied to an MLP architecture with LogitNorm \citep{wei2022mitigating}. 
%as a training method aimed at mitigating overconfidence in the model by enforcing a constant norm on the logit vector.

\section{Results}\label{sec:results}

%\vskip 4pt
%\MakeUppercase{\textbf{4.1 \ Evaluation with synthetic OOD}}
\paragraph{Evaluation with synthetic OOD}

The results of the experiments on tabular datasets are displayed in Fig. \ref{fig:tabular} for eICU and Diabetics datasets, and in Appendix \ref{apd:additional_results} for the rest. Based on the results, our method gives a net positive improvement on most baselines, with two exceptions that are not affected, MDS and SHE. Furthermore, the improvements are more notable in detection methods such as MSP and EBO which rely on the probabilities to generate the novelty scores. 

In addition, it is expected that OOD data generated with a larger scaling factor are detected better as they are farther away from the ID data. However, certain baselines present a different behavior, and their performance is decreased by increasing the scaling factor. This is a sign of overconfidence in the prediction model as the detection method assigns a lower novelty score to the OOD instances farther from the ID data. As hoped, our method addresses this issue: combining the baselines with our method leads to consistently better performance for larger scaling factors.

%\vskip 4pt
%\MakeUppercase{\textbf{4.2 \ Real Tabular Data as OOD}}
\paragraph{Real-world tabular data as OOD}

To assess the performance on real-world tabular OOD datasets, we consider MIMIC-IV as OOD set for eICU and vice versa. Results of this experiment are illustrated in Fig. \ref{fig:mimic_vs_eicu}. Similarly to synthesized OODs, our method significantly improves the OOD detection performance across several baselines without negatively affecting any of them. 
%Hence, our method works well within a real OOD set as well.

The results also indicate that our method grants more improvement when the model is trained on MIMIC-IV as ID. This shows that the prediction model trained on MIMIC-IV suffers more from overconfidence compared to the eICU dataset (see Discussion).
% , probably due to the fact that eICU collects data from a variety of hospitals as observed in previous literature \citep{azizmalayeri2023unmasking}. 
Finally, it is worth noting that the addition of our methods improves all OOD detection performances above the chance threshold of 0.5 AUC (often far better) `reversing' the effect of overconfidence. 

% \vskip 4pt
% \MakeUppercase{\textbf{4.3 \ Other architectures}}
\paragraph{Other architectures}

The architecture of the prediction model plays an important role in its overconfidence. Thus, we employ tabular ResNet and Transformer \citep{gorishniy2021revisiting} to evaluate our approach. Table \ref{tab:artitechtures} displays the results, demonstrating that the addition of CEA outperforms numerous baseline detection methods when applied to ResNet (due to space limitation, we show only some baselines and put the rest in Appendix \ref{apd:additional_results}). However, the improvements with the Transformer are marginal. This aligns with prior observations that Transformer as an architecture mitigates the effect of overconfidence \citep{azizmalayeri2023unmasking}.

It is noteworthy that the OOD detection performance of the pure baseline methods is better on average on the Transformer model as it internally addresses the overconfidence. However, results on ResNet plus CEA often get better than Transformer with the same advantage (especially for MSP, EBO, and React). Hence, while changing the architecture of the prediction model itself can be a solution to overconfidence, its capability for OOD detection is still highly dependent on the way OODs are singled out.

%\vskip 4pt
%\MakeUppercase{\textbf{4.4 \ LogitNorm Training}}
\paragraph{LogitNorm training}

%LogitNorm \citep{wei2022mitigating} is a training loss that mitigates the overconfidence in the prediction model. 
To assess the impact of LogitNorm training, we also train prediction models with this loss instead of softmax cross-entropy loss. Results are provided in Table \ref{tab:logitnorm}. According to this table, our method still manages to improve on the models trained with this dedicated loss across different datasets and baselines. 

Comparing the results from this table and Figure \ref{fig:tabular} indicates that LogitNorm itself leads to better OOD detection as expected. Nevertheless, it does not eliminate overconfidence in the prediction model, e.g., OOD detection using LogitNorm plus MSP (or EBO, or ReAct, or Gram) on Diabetics results in worse performance than a random binary classifier.

% \vskip 4pt
% \MakeUppercase{\textbf{4.5 \ Extension to images}}
\paragraph{Extension to images}

In this section, we evaluate the OOD detection performance within the image domain. Results for MNIST and CIFAR-10 datasets are presented in Table \ref{tab:images} and additional results for CIFAR-100 can be found in Appendix \ref{apd:additional_results}. For synthesized OOD sets, our method significantly improves the performance of many baselines regardless of the choice of prediction model and dataset. Furthermore, it is similarly effective with real-world OOD sets when applied to the ReLU MLP. Nevertheless, the improvements become marginal with the ResNet architecture, to the extent that on heterogeneous ID data such as CIFAR-100 results with and without CEA are the same for real-world OOD sets.

\paragraph{Ablation study on hyperparameters}
$\tau$ and $\lambda$ are the main hyperparameters in our method, regulated via $p$ and $\gamma$, respectively. In this section, we examine the effects of these parameters. To achieve this goal, we evaluate the OOD detection performance on the Diabetics dataset across various values of $p$ and $\gamma$ in Fig. \ref{fig:parameters}. This figure demonstrates that the proposed method can improve the OOD detection for a wide range of choices for these parameters. Nevertheless, it is advisable to fine-tune these parameters to achieve optimal results. Note that we have used the fixed set of parameters described in section \ref{sec:overconfidence_measure}. Therefore, the outcomes of our method could potentially be enhanced by identifying the optimal hyperparameters for each detection method and dataset.

\begin{figure*}[ht]
    \centering
        \includegraphics[width=0.9\textwidth]{images/parameters.png}
        \caption{Impact of parameters on the performance of CEA applied on different baseline OOD detection methods within the Diabetics dataset. (a) $\gamma=10$ and $p$ is changed. (b) $p=99.9$ and $\gamma$ is changed. The dashed lines indicate the performance of OOD detection methods without CEA ($\gamma=0$).}
    \label{fig:parameters}
\end{figure*}

Based on Fig. \ref{fig:parameters}, a higher value of $p$ guarantees that CEA only positively influences the performance of baseline OOD detection methods. However, it might reduce the capability to detect overconfident OOD instances, as it raises the threshold $\tau$.

For investigating $\gamma$, we have set $p$ such that it results in a threshold that remains above node activation values of ID data. Consequently, our method only captures the overconfidence in OOD instances, and increasing $\gamma$ results in a better performance. However, we do not recommend using a large $\gamma$ if the threshold has not been set carefully. In addition, note that a small $\gamma$ can also result in neglecting the effect of CEA on the final novelty score.

\section{Discussion}\label{sec:discussion}
%A common assumption in OOD detection is that prediction models are more uncertain about OOD inputs. However, ML models can provide overconfident predictions, which in turn can reduce the performance of many OOD detection methods. 

As a solution to the overconfidence issue in OOD detection, we proposed CEA, a way to adjust the novelty scores of the post-hoc OOD detection methods by adding a new term to the original novelty score. 
%that captures overconfidence via activations of hidden units. 
CEA is only activated when an OOD input results in an outsized activation compared to the corresponding values over the ID validation set. 
%[[WHAT HAPPENS IF IN ID THERE ARE ALREADY EXTREME PROBABILITIES? THE THRESHOLD WILL HAVE TO BE VERY HIGH AND THEN THE DETECTION OF OOD BECOMES PROBLEMATIC, IS THIS TRUE?]]

To demonstrate the effectiveness of CEA, we conducted experiments on 16 different baseline OOD detection methods across 5 different tabular datasets spanning a wide range of domains in a context where it has been theoretically verified that overconfidence hurts OOD detection. We also explored alternative settings with real-world OOD sets, other architectures like tabular ResNet and Transformer, and image datasets. There was a significant improvement in the performance of numerous baselines across these settings; however, there were also methods and settings that were not affected much, which are discussed below.

%To demonstrate the effectiveness of the proposed method, we utilized the theoretical results showing that a ReLU classifier makes overconfident predictions for OOD instances synthesized by scaling a single variable from ID instances. Based on this setting, we trained prediction models on 5 different datasets spanning a wide range of domains. Next, we performed OOD detection within these prediction models and applied our method to 16 different baseline OOD detection methods. The results show significant improvement in the performance of numerous baselines across different datasets.

Our method enhances baseline detection methods relying on the class probabilities to generate novelty scores such as MSP, EBO, and DICE more than those that measure a distance in the feature space such as MDS, SHE, and KLM. This is because distance-based methods inherently handle overconfidence by relying on distance (measured among internal representations) instead of confidence. Consequently, CEA may not improve much the performance of these kinds of detection methods, as can be observed for MDS in the results. One may be tempted to resort to these approaches instead, but it should be noted that we cannot solely rely on distance-based detection methods as they may not perform well in general, see e.g. MDS applied on the ResNet-32 model trained on the CIFAR-10 dataset. More specifically, another baseline combined with CEA may perform better than methods like MDS. Hence, CEA allows us to replace methods like MDS with other baselines while keeping the benefits of those methods. 

Within the architectures evaluated in our study, the Transformer seems to be more robust against overconfidence. This behavior can be explained based on the theoretical understanding of the problem. The proofs provided on overconfidence assume that the model is a piece-wise affine classifier \citep{hein2019relu}. Nevertheless, the Transformer utilizes activation functions and attention mechanism \citep{vaswani2017attention} which are non-linear, violating this assumption. 
%We note that while non-linearity can be a solution to overconfidence, it can reduce the capability of OOD detection as well.
This property results in better OOD detection performance within detection methods such as MSP and EBO; still, when overconfidence is addressed by our proposed method or methods such as MDS, we see that the non-linearity of Transformer is not so beneficial for OOD detection anymore.


%We also evaluated a more real-world setting in which eICU and MIMIC-IV datasets are considered as the OOD set for the other one. This setting demonstrates significant improvement in the performance of the baseline method when applying our method as well.

%In addition, we assessed two alternative architectures for the prediction model to explore how the results extend to other model structures. In tabular ResNet, our method significantly outperforms many baseline OOD detection methods, but improvements in Transformer are marginal. This is consistent with the observation in previous studies that Transformers alleviate overconfidence. Nevertheless, this alleviation comes at the cost of a compromised OOD detection performance.

%We extended our experiments to the image datasets like CIFAR-10 as well. Our method demonstrates improvements across both real and synthesized OOD data when applied to the simple ReLU classifier. Even within a more complicated architecture like ResNet, our method still outperforms several baselines, but the improvements are marginal on real OOD sets.  

The results also showcase that more heterogeneous ID data reduces overconfidence. For example, models trained on eICU and CIFAR-100 are not overconfident in real-world OOD sets as much as models trained on MIMIC-IV and CIFAR-10, respectively. This may explain the observation in the OpenOOD and other benchmarks \citep{yang2022openood, azizmalayeri2023unmasking} that OOD detection performance is better in some models trained on complex datasets. We also note that model calibration may be another way to mitigate overconfidence; however, our results in Appendix \ref{apd:calibration} demonstrate that calibration improves the OOD detection AUC only marginally.

Lastly, note that the proposed method can be seamlessly incorporated as an extension to any post-hoc OOD detection method, without much computational overhead but with potentially big gains in real-time OOD detection performance. Additionally, it is compatible with other methods proposed for overconfidence such as LogitNorm and ReAct. This property makes this method suitable for many applications, especially those with a high risk of data shift and safety-critical consequences. Even though the applicability of our method may seem impaired by the need to choose hyperparameters, the ablation study demonstrates that the conclusions about CEA are robust within reasonable hyperparameter ranges. 
%[[I ENCOURAGE YOU TO THINK WHWTHER IT IS POSSIBLE OR MAKES SENSE TO TRY TO "INDIVIDUALIZE" THE THRESHOLD: THE THRESHOLD FOR AN INSTANCE COULD BE INFORMED BY THE INSTANCE ITSELF.]]
%as it enhances the robustness of the OOD detection methods against overconfidence.

%Additionally, the experiments indicate that our method works well within a wide range of hyperparameters. Also, it can be combined with other approaches provided for mitigating overconfidence such as LogitNorm.

%To sum up, results demonstrate that our approach can be seamlessly incorporated as an extension into any post-hoc OOD detection method within a prediction model, with minimal risk of compromising detection performance. Furthermore, the improvements are significant in the tabular datasets even in the real-world case study. 
In summary, we believe that the proposed method can increase the reliability of OOD detection methods and benefit a wide range of domains that currently use ML models and OOD detection such as healthcare (e.g., disease recognition or mortality prediction), financial services (e.g., fraud detection), transportation (e.g., autonomous vehicles), and cybersecurity (e.g., identification of OOD network patterns).
Our study not only offers a practical solution but also provides insights that open the door to research exploring alternative solutions to overconfidence in OOD detection. On the experimental side, future work can also consider the application of CEA within alternative domains, including but not limited to time-series and text data, to enrich the understanding of the problem.
On the theoretical realm, it might be worth investigating which properties of CEA (or the term $g$ more generally) are sufficient to guarantee the absence of overconfidence in OOD detection.
% Furthermore, the theoretical examination of overconfidence and its influence on OOD detection could also be further investigated in broader contexts.

\iffalse
\begin{contributions} % will be removed in pdf for initial submission 
					  % (without ‘accepted’ option in \documentclass)
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
    Briefly list author contributions. 
    This is a nice way of making clear who did what and to give proper credit.
    This section is optional.

    H.~Q.~Bovik conceived the idea and wrote the paper.
    Coauthor One created the code.
    Coauthor Two created the figures.
\end{contributions}
\fi

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
This study was part of the project \textit{Research in clinical prediction models and natural language processing with deep learning} (project number NWO-2021.024) of the research program Computing Time on National Computer Facilities. The computational resources used were financed by the Dutch Research Council (NWO).

\end{acknowledgements}

% References
\bibliography{uai2024-template}

\newpage

\onecolumn

\title{Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme Activations \\(Supplementary Material)}
\maketitle
\vspace{3pt}
%This Supplementary Material should be submitted together with the main paper.

\appendix

\section{Baseline Post-hoc OOD Detection Methods}\label{apd:baselines}

In this section, we provide a summary of baseline OOD detection methods included in our experiments. We have selected these methods following the recent related benchmarks \citep{yang2022openood, zhang2023openood, azizmalayeri2023unmasking}, and detailed information about them can be found in the code and original studies.

\textbf{MDS} \citep{lee2018simple}: This method fits a class-conditional Gaussian distribution $\mathcal{N}(\mu_k, \Sigma)$ to the feature vector before the logits. The covariance matrix $\Sigma$ is shared between the classes, but the mean $\mu_k$ is computed separately for each class $k\in\{1, 2, ..., K\}$. The novelty score for an input x with pre-logit features $l_x$ is computed as:
\begin{equation}
    \min_{k}\ \ (h_x-\mu_k)^T\Sigma(h_x-\mu_k) .
\end{equation}

\textbf{RMDS} \citep{ren2021simple}: Motivated by the observation that MDS does not work well on near-OOD data, they suggest fitting a single distribution $\mathcal{N}(\mu, \Sigma)$ to the feature vector before the logits and normalizing the distances measured by MDS as:
\begin{equation}
    MDS(h_x)-(h_x-\mu)^T\Sigma(h_x-\mu) .
\end{equation}
They believe that this fix to MDS makes it more robust on near-OOD sets.

\textbf{KNN} \citep{sun2022out}: They suggest non-parametric nearest-neighbor distance for OOD detection. More specifically, they compute the distance to the $k_{th}$ nearest neighbor distance from the training set as the novelty score. The distance is computed based on the embedding extracted from each input.

\textbf{ViM} \citep{wang2022vim}: This study suggests that softmax probability and features should be used simultaneously to be capable of detecting different types of OOD. Accordingly, they combine a class-agnostic score from the feature space with the class-dependent scores from logits. More specifically, their method is based on the idea that some information from feature space is not carried to the logits. They recover this information from the principal subspace of features and add it to the score from logits.

\textbf{SHE} \citep{zhang2022out}: This method shares similarities with MDS regarding the intuition behind the method. SHE stores a single class-dependent pattern from the penultimate layer of the neural network over the training set. Afterward, it leverages an energy function defined in Modern Hopfield Network \citep{ramsauer2020hopfield} to measure the distance between a new input pattern and stored patterns.

\textbf{KLM} \citep{hendrycks2022scaling}: For each class of data, they average the probability vector extracted from the validation samples classified in the corresponding class by the prediction model. The novelty score for an input is then computed based on the minimum KL distance of its probability vector from the class-dependent probabilities computed earlier.

\textbf{OpenMax} \citep{bendale2016towards}: 
On a dataset with $k$ classes, they propose to change the softmax probability such that it generates a probability vector for $k+1$ classes, where the last class corresponds to the open-set class. To achieve this, they reweight the original probability vector by fitting a Weibull distribution to the class-dependent probabilities.

\textbf{MSP} \citep{hendrycks2017a}: It is a simple but effective baseline proposed for OOD detection. It is motivated by the intuition that the maximum softmax value for an OOD input should not be as large as ID data. So, it utilizes maximum softmax probability to compute to novelty score.

\textbf{MLS} \citep{hendrycks2022scaling}: As an alternative to maximum softmax probability, MLS suggest to use maximum logit. Experiments on MLS have demonstrated that it performs better than MSP in large-scale multi-class, multi-label, and
segmentation tasks.

\textbf{TempScaling} \citep{guo2017calibration}: This method calibrates the softmax temperature over a validation before applying MSP for OOD detection.

\textbf{EBO} \citep{liu2020energy}: The intuition behind EBO is that $p(y|x)$ used in methods such as MSP should be replaced with $p(x)$, which shows better whether an input $x$ comes from the training distribution. For this purpose, they propose an energy-based framework for OOD detection that computes novelty scores based on the energy score.

\textbf{GRAM} \citep{sastry2020detecting}: They characterize the intermediate representations of the neural network by GRAM matrices. OOD inputs are identified by comparing the values in the GRAM matrices to their respective range computed over the training set.

\textbf{GradNorm} \citep{huang2021importance}: The key idea in GradNorm is that the magnitude of gradient back-propagated from the KL distance between the softmax vector and a uniform probability vector would be larger for ID data than that of OOD data. This makes sense as OOD data are generally expected to yield a uniform probability vector for OOD data.

\textbf{ReAct} \citep{sun2021react}: This method rectifies activation units at the penultimate layer of the neural network at an upper limit computed over a validation set. This helps to reduce the impact of overconfidence in the generated novelty score. They suggest applying EBO after rectification, but it can be combined with other detection methods as well.

\textbf{DICE} \citep{sun2022dice}: The idea of DICE is that reliance of neural networks on unimportant weights and units can reduce the OOD detection performance. Accordingly, DICE proposes to rank weights based on a contribution measure and only use the more contributing ones in OOD detection.  A simple example of the contribution measure is averaging the output of each weight over a validation set.

\textbf{ASH} \citep{djurisic2022extremely}: This study extends the neural network sparsification idea and proposes to remove a large proportion of an input's activations and lightly adjust the rest. The change in the weights is case-specific and does not require any statistic from the training set.


\section{Dataset and Task Details}\label{apd:datasets}
In this section, we present information about the datasets and the associated prediction tasks for which they are employed. These datasets are publicly available (some need access authorization).

\subsection{Tabular}

\textbf{eICU:} The eICU Collaborative Research Database is a dataset containing health data from the patients admitted to the United States ICUs in 2014-2015. This dataset can be accessed through PhysioNet \footnote{\url{https://physionet.org/content/eicu-crd/2.0/}} but requires to be a credentialed user on the website. For pre-processing this dataset, we followed the guidelines in prior works \citep{ulmer2020trust, azizmalayeri2023unmasking}. More specifically, we employed the pipeline provided in \citet{sheikhalishahi2020benchmarking}\footnote{\url{https://github.com/mostafaalishahi/eICU_Benchmark_updated}} to keep patients with a length of stay of at least 48 hours, age greater than 18, and known discharge status. Since some of the variables are not available for some patients, they have suggested a list of more frequent variables provided in Table \ref{tab:variables} to be used in the analysis. Patients lacking data for any of these variables are excluded from the dataset, which resulted in 54826 unique patients. Moreover, they aggregate the time-dependent variables through 6 different statistics including minimum, maximum, mean, standard deviation, skewness, and number of observations computed over windows consisting of the full time-series and its first and last 10\%, 25\%, and 50\%. This dataset is then used for the mortality prediction task, where the data collected in the first 48 hours from patients is used to predict in-hospital mortality. The mortality rate in this dataset is 6.77\%.

\textbf{MIMIC-IV:} The Medical Information Mart for Intensive Care (MIMIC) dataset provided critical care data for patients admitted to ICU at the Beth Deaconess Medical Center. This dataset is accessible via credentializing on the PhysioNet website \footnote{\url{https://physionet.org/content/mimiciv/2.2/}}. This dataset is pre-processed mainly similar to the eICU dataset. Initially, the data undergoes the pre-processing pipeline presented in \citet{gupta2022extensive}\footnote{\url{https://github.com/healthylaife/MIMIC-IV-Data-Pipeline}}, followed by filtering procedures similar to those employed in eICU. This resulted in 18180 unique patients in this dataset. This dataset is used for mortality prediction as in eICU and has a mortality rate of 12.57\%.

\begin{table}[t]
\centering
\begin{tabular}{lll}
\toprule
Description & eICU                  & MIMIC-IV                 \\ \midrule
\multicolumn{2}{l}{\textbullet{} \textit{Time-dependent}}             \vspace{3pt}  \\
Blood pH value & pH                    & pH                       \\
Body temperature & Temperature (c)        & Temperature              \\
Respiratory rate & Respiratory Rate      & Respiratory rate         \\
Blood oxygen saturation & O2 Saturation         & Oxygen saturation        \\
Mean arterial pressure & MAP (mmHg)            & Mean blood pressure      \\
Heart rate & Heart Rate            & Heart Rate               \\
Blood glucose level & glucose               & Glucose                  \\
Glasgow coma scale (total) & GCS Total             & -                \\
Glasgow coma scale (motor functions) & Motor                 & -                 \\
Glasgow coma scale (eyes) & Eyes                  & -                \\
Glasgow coma scale (verbal) & Verbal                & -                 \\
Fraction of inspired oxygen & FiO2                  & -               \\
Diastolic Blood Pressure & Invasive BP Diastolic & Diastolic blood pressure \\
Systolic Blood Pressure & Invasive BP Systolic  & Systolic blood pressure  \\ \midrule
\multicolumn{2}{l}{\textbullet{} \textit{Time-independent}}           \vspace{3pt}  \\
Gender & gender                & gender                   \\
Age & age                   & age                      \\
Ethnicity & ethnicity             & -               \\
Height at admission time & admissionheight       & -                \\
Weight at admission time & admissionweight       & -               \\
Admission type & -              & admission\_type          \\
First care unit & -             & first\_careunit     \\ \bottomrule    
\end{tabular}
\caption{Clinical variables used for each of eICU and MIMIC-IV datasets. Dash means that the variable is not included in the dataset.}
\label{tab:variables}
\end{table}

\textbf{Diabetic Retinopathy Debrecen:} Diabetic retinopathy is a kind of diabetes that affect eyes by by damaging the blood vessels of the light-sensitive tissue. To facilitate the research studies on diabetic retinopathy, Messidor \citep{decenciere2014feedback}, a collection of Diabetic Retinopathy examinations consisting of two macula-centered eye fundus images, is provided.  Diabetic Retinopathy Debrecen is a dataset containing features extracted from the Messidor images to predict whether an image has signs of diabetic retinopathy. This dataset contains 1151 instances and can be accessed through the UCI machine learning repository \footnote{\url{https://archive.ics.uci.edu/dataset/329/diabetic+retinopathy+debrecen}}. This dataset is a balanced data and the proportion of positive labels is 53.1\%.

\textbf{Dry Bean:} This dataset contains features from 7 different registered varieties of dry beans. It can be accessed through the UCI machine learning repository \footnote{\url{https://archive.ics.uci.edu/dataset/602/dry+bean+dataset}} and contains 13611 instances. 
The task in this dataset involves distinguishing various types of dry beans.

\textbf{Wine Quality:} This dataset combines data from the red and white vinho verde wine samples from the north of Portugal. It is publicly available through the UCI machine learning repository \footnote{\url{https://archive.ics.uci.edu/dataset/186/wine+quality}} and contains 4898 instances in total. This dataset is designed to predict the quality of wine based on physicochemical tests but can be used for color classification as well. In this study, we leverage it for the latter purpose.


\subsection{Images}

\textbf{MNIST:} The MNIST dataset is a collection of handwritten digits. The images in this dataset are grayscale and have a shape of 28$\times$28. It has a training and test set including 60,000 and 10,000 instances, respectively. The prediction task in this dataset involves classifying the digits.

\textbf{Fashion-MNIST:}  This dataset contains images from Zalando's articles, consisting of 60,000 images in the training set and 10,000 in the test set. The images size are 28$\times$28.

\textbf{SVHN:} This dataset contains 600,000 images of digits used in house numbers. The images are 32$\times$32 and colored.

\textbf{CIFAR-10:} The CIFAR-10 dataset consists of 10 classes of colored 32$\times$32 images. The training and test contain 50,000 and 10,000 instances, respectively.

\textbf{CIFAR-100:} The CIFAR-100 dataset consists of 100 classes of colored 32$\times$32 images. 
Each class in the training and test set contains 500 and 100 images, respectively, which makes it as large as the CIFAR-10 dataset.

%\textbf{ImageNet-1k:} This dataset consists of 1000 object classes with high-resolution images. The images are commonly converted to 224$\times$224 images through cropping and resizing transformations. The training and validation set in this dataset contain 1,281,167 and 50,000 images, respectively. 

\section{Proof of Theorem 1}\label{apd:proof}

Before we come to Theorem 1, we first present a lemma needed in the proof.
\begin{lemma}
Let $x\in R^D$ such that, for a given class $c$ in the output of the softmax function $\sigma$, the following limit holds:
\begin{equation}\label{eq:lemma}
    \lim_{{x_d \to \infty}} \sigma(f(x))_c = 1,
\end{equation}
where $f$ denotes an arbitrary function. Then, we can infer that:
\begin{equation}\label{eq:lemma_result}
\exists\ c', \lim_{x_d \to \infty} f(x)_{c'} =\infty.
\end{equation}
\end{lemma}
\begin{proof}
Given the continuity of the softmax, we can rewrite the limit of the composition of the two functions as 
\begin{equation}
    \sigma[\lim_{{x_d \to \infty}} f(x)]_c = 1,
\end{equation}
Unfolding the definition of the softmax and moving the denominator across the equality we can conclude that 
\begin{equation}
    e^{\lim_{x_d \to \infty} [f(x)]_c} = \sum_{c'=1}^{|C|} e^{\lim_{x_d \to \infty} [f(x)]_{c'}},
\end{equation}
% This entails that 
% \begin{equation}
%     0 = \sum_{c'=1, c'\neq c}^{|C|} e^{[\lim_{x_d \to \infty} f(x)]_{c'}},
% \end{equation}
Since the output of the exponential is always a positive number larger than 0, the previous equation cannot hold if  for all class indexes $c'$, $\lim_{{x_d \to \infty}} [f(x)]_{c'}$ is a finite number. Hence, for at least one index said limit must equal (plus or minus) infinity.

% To prove this lemma, we first consider the softmax function's definition:
% \begin{equation}\label{eq:softmax_def}
%    \sigma(f(x))_c = \frac{e^{f(x)_c}}{\sum_{j=1}^{n} e^{f(x)_j}} \stackrel{Eq. \ref{eq:lemma}}{\implies} \lim_{x_d \to \infty} e^{f(x)_c} \to \lim_{x_d \to \infty} \sum_{j=1}^{n} e^{f(x)_j}.
% \end{equation}
% If $\lim\limits_{x_d \to \infty} f(x)_c \to \infty$, the lemma is trivially true. Otherwise, from Equation \ref{eq:softmax_def}, we deduce:
% \begin{equation}
%     \lim_{x_d \to \infty} \sum_{\substack{j=1 \ j \neq c}}^{n} e^{f(x)_j} \to 0\ 
%  \ \stackrel{e^{f(x)_j}\geq0}{\implies} \ \ \forall j \in {1, \ldots, n} \setminus {c}: \lim_{x_d \to \infty} f(x)_j \to -\infty,
% \end{equation}
% which validates the lemma.
\end{proof} 
\setcounter{theorem}{0}
\begin{theorem}
Let $x\in R^D$ and suppose $\alpha$ is a scaling vector.  Now $x'= \alpha \odot x$ can be considered as an OOD example if $\alpha$ is large enough. Let $h_\theta$ be any neural network whose last layer is linear, genearting an overconfident prediction for class $c$ on $x'$ in a C-class classification as:
\begin{equation}\label{eq:oc}
    \lim_{{\alpha_d \to \infty}} {\sigma(h_\theta (x'))}_c = 1,
\end{equation}
where $\sigma$  is the Softmax function. Then, 
we infer that there exists at least a dimension in which the output of the penultimate layer goes to infinity in the limit:
\begin{equation}
    \exists\ k, \lim_{\alpha_d \to \infty} {(x_{R-1}')}_k = \infty,
\end{equation}
where $x_{R-1}' \in R^{D'}$ is the output of the penultimate layer.
%there is at least one value at the output of the penultimate layer that tends to infinity.
\end{theorem}

\begin{proof}
%Suppose $x_{R-1}'$ is the output of the penultimate layer, and 
Let $w_R \in R^{C\times D'}$  and $b_R \in R^{C}$ denote the weights and biases at the last linear layer of the neural network $h_\theta$. Then, $h_\theta(x')$ can be formulated as:
\begin{equation}\label{eq:linear}
    h_\theta (x') = w_R x_{R-1}' + b_R.
\end{equation}
Furthermore, by Lemma 1, Equation \ref{eq:oc} implies that:
\begin{equation}\label{eq:infty}
\exists\ c', \lim_{\alpha_d \to \infty} {h_\theta (x')}_{c'} = \infty.
\end{equation}
Now, substituting Equation \ref{eq:linear} into Equation \ref{eq:infty}, we have:
\begin{equation}\label{eq:pen}
\exists\ c', \lim_{\alpha_d \to \infty} {(w_R x'_{R-1} + b_R)}_{c'} = \infty,
\end{equation}
where $b_R$ is just a vector of scalar values, which can be disregarded in this limit. So, we rewrite the matrix multiplication for index $c'$ in Equation \ref{eq:pen} as:
\begin{equation}
\exists\ c', \lim_{\alpha_d \to \infty} \sum_{k=1}^{D'} {{({w_R})}_{c',k}\ {{({x}_{R-1}'})}_k} = \infty, %\implies \exists\ k, c', \lim_{\alpha_d \to \infty} {{({w_R})}_{c',k}\ {{({x}_{R-1}'})}_k} \to \infty.
\end{equation}
where $D'$ is the set of indices of the output of the penultimate layer. This entails that at least one of the members of the sum must tend to infinity. Since ${({w_R})}_{c',k}$ is just a scalar value,  we deduce that:
\begin{equation}
\exists\ k, \lim_{\alpha_d \to \infty} {{{({x}_{R-1}'})}_k} = \infty.
\end{equation}
This means that the feature vector at the penultimate layer consists of at least one value that goes to infinity in the limit, completing the proof.
\end{proof}

\section{Additional results}\label{apd:additional_results}
This section includes some extra results for the experiments discussed in the main text.

\subsection{Tabular data}
Results for three other tabular datasets are illustrated in Fig. \ref{fig:tabular_apd}, which aligns with the other tabular datasets discussed in the main text.

\begin{figure}[ht]
\centering
\begin{subfigure}{}
\centering
     \includegraphics[width=\linewidth]{images/main_mimic.png}
\end{subfigure}
\\
\begin{subfigure}{}
\centering
\vskip -10pt
         \includegraphics[width=\linewidth]{images/main_bean.png}
\end{subfigure}
\\
\begin{subfigure}{}
\centering
\vskip -10pt
        \includegraphics[width=\linewidth]{images/main_wine.png}
\end{subfigure}
\caption{OOD detection performance with and without CEA using the MIMIC-IV (top), Dry Bean (middle), and Wine Quality (bottom) datasets as ID  and synthesized data by scaling. The blue bars are positioned in front of the red ones and cross markers are employed to emphasize the top of the red bars. The scaling factors and baseline names are presented under each bar.}
    \label{fig:tabular_apd}
\end{figure}

\subsection{Architectures and LogitNorm}

In the results presented for other types of architectures and LogitNorm training, we only included some of the baseline detection methods to keep the page limit. 
%We presented results for some of the more relevant baselines in the experiments with other architectures and LogitNorm training. 
Results for other detection methods are displayed in tables \ref{tab:artitechtures_extra_results} and \ref{tab:logitnorm_extra_results}. Conclusions on the baselines included in these tables are the same as the others discussed in the results section.


\begin{table*}[ht]
\centering
\setlength{\tabcolsep}{7.3pt}
\caption{
AUC of OOD detection with and without CEA using tabular ResNet and Transformer as the prediction model. We use eICU and Diabetics as ID and synthesize the OOD data by scaling factor $\alpha$. Superior results are emphasized in bold unless the two are equal. This table is similar to Table \ref{tab:artitechtures}, but includes different baseline detection methods.}
\begin{tabular}{ccccc|ccc}
\toprule
\multicolumn{1}{l}{}                            &                            & \multicolumn{3}{c|}{ResNet}                                                  & \multicolumn{3}{c}{Transformer}                                             \\ \cmidrule{3-8} 
\multicolumn{1}{l}{}                            & \multicolumn{1}{c}{}       & \multicolumn{1}{c}{$\alpha=10$} & \multicolumn{1}{c}{$\alpha=100$} & \multicolumn{1}{c|}{$\alpha=1000$} & \multicolumn{1}{c}{$\alpha=10$} & \multicolumn{1}{c}{$\alpha=100$} & \multicolumn{1}{c}{$\alpha=1000$} \\ \cmidrule{3-8} 
ID                     & Method                     & \multicolumn{6}{c}{Baseline / Baseline\hskip0.75pt\&\hskip0.75pt CEA}                                                                                                                          \\ \midrule
\multicolumn{1}{l|}{\multirow{9}{*}{eICU} }                           & \multicolumn{1}{l|}{RMDS}
     & 52.6 / \textbf{66.4} & 64.7 / \textbf{85.5} & 79.4 / \textbf{93.4} & 52.2 / 52.4 & 60.8 / \textbf{61.0} & 72.9 / \textbf{73.2} \\
     \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{SHE}
                                 & 72.9 / 72.9 & \textbf{89.9} / 89.8 & 93.8 / \textbf{93.9} & 57.7 / 57.7 & 73.2 / 73.2 & 81.5 / \textbf{81.6} \\
       \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{KLM}                               
                                  & 58.7 / \textbf{66.8} & 73.0 / \textbf{85.6} & 82.9 / \textbf{93.3} & 56.0 / \textbf{56.2} & 65.1 / \textbf{66.1} & 72.8 / \textbf{73.2} \\
           \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{OpenMax}   
                         & 58.6 / \textbf{69.2} & 69.6 / \textbf{87.1} & 79.6 / \textbf{93.6} & 51.1 / \textbf{51.7} & 54.2 / \textbf{56.1} & 71.2 / \textbf{72.6} \\
                       \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{MLS}
                               & 46.5 / \textbf{69.2} & 28.8 / \textbf{87.1} & 13.3 / \textbf{93.6} & 51.7 / \textbf{52.4} & 56.3 / \textbf{58.0} & 71.8 / \textbf{73.4} \\
                               \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{TempScale}
                           & 47.9 / \textbf{69.5} & 30.6 / \textbf{87.3} & 13.2 / \textbf{93.6} & 51.7 / \textbf{52.5} & 56.1 / \textbf{58.3} & 71.7 / \textbf{73.5} \\
                            \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{GradNorm}
                           & 37.1 / \textbf{66.5} & 17.3 / \textbf{84.9} & 7.2 / \textbf{93.4}  & 53.5 / \textbf{54.1} & 63.1 / \textbf{64.2} & 76.3 / \textbf{77.1} \\
                           \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{DICE}
                              & 42.2 / \textbf{67.0} & 24.3 / \textbf{85.6} & 10.0 / \textbf{93.4} & 53.3 / \textbf{53.8} & 62.8 / \textbf{63.6} & 76.0 / \textbf{76.8} \\
                           \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{ASH}
                               & 47.2 / \textbf{69.5} & 30.8 / \textbf{87.2} & 14.4 / \textbf{93.5} & 51.7 / \textbf{52.4} & 56.0 / \textbf{57.8} & 70.5 / \textbf{72.1} \\ \midrule
                           \multicolumn{1}{l|}{\multirow{9}{*}{Diabetics}}                           & \multicolumn{1}{l|}{RMDS}
      & 74.8 / \textbf{77.4} & 85.8 / \textbf{87.5} & 90.1 / \textbf{91.0} & 65.3 / \textbf{65.4} & 80.2 / \textbf{80.3} & 86.3 / \textbf{86.3} \\
      \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{SHE}
                                  & 81.9 / 81.9 & 88.7 / 88.7 & 91.5 / 91.5 & 69.2 / \textbf{69.3} & 85.3 / 85.3 & 90.0 / 90.0 \\
    \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{KLM}
                               & 74.0 / \textbf{77.2} & 84.9 / \textbf{87.0} & 89.8 / \textbf{90.4} & 55.9 / \textbf{56.2} & 57.8 / \textbf{57.9} & 56.4 / \textbf{57.3} \\
    \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{OpenMax}
                              & 55.8 / \textbf{73.7} & 66.2 / \textbf{86.6} & 68.8 / \textbf{90.2} & 42.7 / \textbf{43.9} & 50.0 / \textbf{50.6} & 64.2 / \textbf{65.6} \\
    \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{MLS}
                                 & 33.6 / \textbf{67.8} & 22.7 / \textbf{86.2} & 18.4 / \textbf{90.2} & 41.0 / \textbf{41.9} & 50.3 / \textbf{50.7} & 66.3 / \textbf{67.4} \\
    \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{TempScale}
                           & 25.6 / \textbf{67.4} & 15.5 / \textbf{86.3} & 10.5 / \textbf{90.3} & 38.4 / \textbf{39.9} & 47.3 / \textbf{48.2} & 61.8 / \textbf{62.7} \\
    \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{GradNorm}
                           & 21.4 / \textbf{65.2} & 13.2 / \textbf{85.7} & 9.7 / \textbf{90.0}  & 38.7 / \textbf{39.7} & 47.6 / \textbf{48.2} & 62.2 / \textbf{63.0} \\
    \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{DICE}
                                 & 22.5 / \textbf{66.1} & 12.9 / \textbf{85.4} & 9.5 / \textbf{89.8}  & 63.1 / \textbf{64.0} & 82.6 / \textbf{82.8} & 88.9 / \textbf{89.0} \\
    \multicolumn{1}{l|}{}                           & \multicolumn{1}{l|}{ASH}
                                  & 35.0 / \textbf{68.9} & 23.3 / \textbf{86.1} & 17.8 / \textbf{90.2} & 43.1 / \textbf{44.0} & 51.5 / \textbf{52.0} & 65.2 / \textbf{65.6}
               \\ \bottomrule
\end{tabular}
\label{tab:artitechtures_extra_results}
\end{table*}

\begin{table*}[ht]
\centering
\setlength{\tabcolsep}{4.0pt}
%\vskip 5 pt
\caption{
AUC of OOD detection with and without CEA on the model trained with the LogitNorm loss.
We use eICU and Diabetics as ID and synthesize the OOD data by scaling factor $\alpha$. Superior results are emphasized in bold unless the two are equal. This table is similar to Table \ref{tab:logitnorm}, but includes different baseline detection methods.}
\resizebox{\columnwidth}{!}{
\begin{tabular}{ccccccccccc}
\toprule
\multirow{2}{*}{ID}        & \multirow{2}{*}{$\alpha$} & RMDS      & SHE       & KLM       & OpenMax   & MLS       & TempScale & \multicolumn{1}{c}{GradNorm} & \multicolumn{1}{c}{DICE} & \multicolumn{1}{c}{ASH} \\ \cmidrule{3-11} 
                           &                                        & \multicolumn{9}{c}{Baseline / Baseline\hskip0.75pt\&\hskip0.75pt CEA}                                                                          \\ \midrule
\multirow{3}{*}{eICU}      & 10                                     & 61.3 / \textbf{64.9} & 65.3 / 65.3 & 51.5 / \textbf{63.9} & 52.6 / \textbf{65.7} & 55.1 / \textbf{67.3} & 55.1 / \textbf{67.3} & 33.9 / \textbf{60.5}                    & 38.6 / \textbf{63.0}                & 57.3 / \textbf{65.4}               \\
                           & 100                                    & 74.8 / \textbf{80.1} & 80.9 / 80.9 & 52.0 / \textbf{78.6} & 53.6 / \textbf{80.0} & 61.8 / \textbf{82.7} & 61.8 / \textbf{82.7} & 18.9 / \textbf{76.1}                    & 25.3 / \textbf{77.8}               & 62.4 / \textbf{80.1}               \\
                           & 1000                                   & 85.2 / \textbf{89.1} & 89.9 / \textbf{90.0} & 52.4 / \textbf{89.0} & 54.3 / \textbf{89.2} & 64.6 / \textbf{90.0} & 64.6 / \textbf{90.0} & 10.3 / \textbf{87.9}                    & 14.0 / \textbf{88.3}                & 63.2 / \textbf{89.3}               \\ \midrule
\multirow{3}{*}{Diabetics} & 10                                     & 68.9 / \textbf{73.3} & 78.8 / 78.8 & 56.7 / \textbf{73.8} & 60.8 / \textbf{74.9} & 35.2 / \textbf{65.1} & 35.2 / \textbf{65.1} & 21.4 / \textbf{60.9}                    & 19.1 / \textbf{60.1 }               & 19.2 / \textbf{60.2}               \\
                           & 100                                    & 84.2 / \textbf{86.8} & 88.4 / 88.4 & 59.3 / \textbf{86.0 }& 63.8 / \textbf{86.7} & 32.6 / \textbf{85.7} & 32.6 / \textbf{85.7} & 12.6 / \textbf{84.6}                    & 11.5 / \textbf{83.9}                & 11.6 / \textbf{84.1}               \\
                           & 1000                                   & 89.7 / \textbf{90.8} & 91.5 / 91.5 & 62.1 / \textbf{90.0 }& 65.6 / \textbf{90.6} & 31.8 / \textbf{90.1} & 31.8 / \textbf{90.1} & 9.5 / \textbf{89.6 }                    & 8.7 / \textbf{89.0}                 & 8.8 / \textbf{89.1}         \\ \bottomrule      
\end{tabular}
}
\label{tab:logitnorm_extra_results}
\end{table*}

\subsection{Extension to Images}
Results for the CIFAR-100 dataset are displayed in Table \ref{tab:images_apd}. According to this table, when CIFAR-100 is the ID set, our method can improve results significantly within synthesized OOD sets, but not on real-world OOD sets. For instance, results on real-world OOD sets are the same with and without our method in the ResNet-32 model. 
This table also provides results for detection methods not included in the main text for the MNIST and CIFAR-10 datasets due to page limits, which follow the same trend as those in the main text.

\iffalse
\begin{table*}[t]
\centering
\setlength{\tabcolsep}{7.3pt}
\caption{OOD detection performance with and without CEA using CIFAR-100 as ID, and OOD sets are synthesized by i) scaling or ii) an adversarial attack, or iii) selected from other datasets. Also, ResNet-32 and ReLU MLP classifiers are used as the prediction model. Superior results are emphasized in bold unless the two are equal.}
\begin{tabular}{ccccc|ccc}
\toprule
\multicolumn{1}{l}{}                           &                            & \multicolumn{3}{c|}{ReLU MLP}                                                      & \multicolumn{3}{c}{ResNet-32}                                                     \\ \cmidrule{3-8} 
\multicolumn{1}{l}{}                           & \multicolumn{1}{c}{}       & \multicolumn{1}{c}{Scale} & \multicolumn{1}{c}{Attack} & \multicolumn{1}{c|}{Real} & \multicolumn{1}{c}{Scale} & \multicolumn{1}{c}{Attack} & \multicolumn{1}{c}{Real} \\ \cmidrule{3-8} 
ID                   & Method                     & \multicolumn{6}{c}{Baseline / Baseline\hskip0.75pt\&\hskip0.75pt CEA}                                                                                                                                      \\ \midrule
\multicolumn{1}{l|}{\multirow{7}{*}{CIFAR-100}} & \multicolumn{1}{l|}{MDS}   & 96.7 / \textbf{97.0}              & \textbf{95.6} / 94.4               & 57.5 / 57.7                 & 98.7 / 98.7              & 99.9 / 99.9               & 53.1 / 53.1                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{KNN}   & 76.5 / \textbf{96.3}              & 56.5 / \textbf{86.2}               & 63.4 / \textbf{66.3}                 & 99.1 / 99.1              & 64.8 / \textbf{97.3}               & 76.0 / 76.0                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{ViM}   & 65.2 / \textbf{96.1}              & 72.3 / \textbf{79.2}               & 60.0 / \textbf{61.1}                 & 5.6 / \textbf{95.2}               & 0.1 / \textbf{83.7}                & 82.1 / 82.1                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{MSP}   & 8.0 / \textbf{84.7}               & 11.1 / \textbf{31.7}               & 49.7 / \textbf{50.9}                 & 29.0 / \textbf{97.6}              & 2.0 / \textbf{91.5}                & 75.0 / \textbf{75.1}                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{EBO}   & 55.0 / \textbf{94.7}              & 64.6 / \textbf{75.6}               & 60.2 / \textbf{61.6}                 & 5.6 / \textbf{95.2}               & 0.1 / \textbf{83.3}                & 82.1 / 82.1                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{ReAct} & 89.0 / \textbf{95.4}              & 86.7 / \textbf{89.8}               & 59.7 / \textbf{61.7}                 & 93.2 / \textbf{98.4}              & 12.8 / \textbf{91.7}               & 77.7 / 77.7                \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{l|}{Gram}  & 4.9 / \textbf{11.4}               & 9.4 / \textbf{19.5}                & 54.4 / \textbf{55.5}                 & 1.2 / \textbf{79.6}               & 0.0 / \textbf{62.9}                & 72.3 / 72.3               
    
    \\ \bottomrule           
\end{tabular}
\label{tab:images_apd}
\end{table*}
\fi

% Please add the following required packages to your document preamble:
% \usepackage{multirow}
\begin{table*}[ht]
\centering
%\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{7.3pt}
\caption{AUC of OOD detection with and without CEA in image datasets. MNIST, CIFAR-10, and CIFAR-100 serve as ID, and OOD sets are synthesized by i) scaling or ii) an adversarial attack, or iii) selected from other datasets. ResNet-32 and ReLU MLP classifiers are used as the prediction model. Superior results are in bold unless the two are equal. This table is similar to Table \ref{tab:images}, but includes CIFAR-100 as an ID dataset and different baseline detection methods.}
\begin{tabular}{llcccccc}
\toprule
                                                 &                                & \multicolumn{3}{c|}{ReLU MLP}                                & \multicolumn{3}{c}{ResNet-32}           \\ \cmidrule{3-8} 
                                                 & \multicolumn{1}{c}{}           & Scale       & Attack      & \multicolumn{1}{c|}{Other}       & Scale       & Attack      & Other       \\ \cmidrule{3-8} 
\multicolumn{1}{c}{ID}                           & \multicolumn{1}{c}{Method}     & \multicolumn{6}{c}{Baseline / Baseline\hskip0.75pt\&\hskip0.75pt CEA}                                                                          \\ \midrule
\multicolumn{1}{l|}{\multirow{9}{*}{MNIST}}      & \multicolumn{1}{l|}{RMDS}      & 62.6 / \textbf{62.7} & 83.6 / \textbf{84.1} & \multicolumn{1}{c|}{96.7 / \textbf{96.9}} & 68.0 / 68.0 & 99.7 / 99.7 & 99.6 / 99.6 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{SHE}       & 63.3 / 63.3 & 92.1 / \textbf{93.2} & \multicolumn{1}{c|}{\textbf{86.3} / 85.9} & 61.0 / 61.0 & 99.9/99.9   & 99.7 / 99.7 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{KLM}       & 57.9 / \textbf{64.9} & 69.5 / \textbf{80.8} & \multicolumn{1}{c|}{79.5 / \textbf{85.8}} & 56.5 / \textbf{59.2} & 45.3 / \textbf{98.7} & 85.5 / \textbf{96.6} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{OpenMax}   & 62.7 / \textbf{62.8} & 85.8 / \textbf{88.5} & \multicolumn{1}{c|}{90.4 / \textbf{92.0}} & 57.2 / \textbf{57.4} & 99.2 / \textbf{99.6} & 99.4 / \textbf{99.5} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{MLS}       & 45.9 / \textbf{56.5} & 7.4 / \textbf{42.5}  & \multicolumn{1}{c|}{81.2 / \textbf{92.7}} & 50.6 / \textbf{56.5} & 2.7 / \textbf{98.8}  & 76.5 / \textbf{99.0} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{TempScale} & 49.8 / \textbf{50.8} & 15.3 / \textbf{65.6} & \multicolumn{1}{c|}{80.3 / \textbf{89.2}} & 53.4 / \textbf{55.9} & 6.0 / \textbf{99.4}  & 94.8 / \textbf{98.4} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{GradNorm}  & 43.4 / \textbf{57.2} & 5.9 / \textbf{41.3}  & \multicolumn{1}{c|}{39.0 / \textbf{61.6}} & 37.4 / \textbf{64.4} & 1.3 / \textbf{98.9}  & 62.8 / \textbf{98.3} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{DICE}      & 42.1 / \textbf{59.1} & 5.9 / \textbf{43.1}  & \multicolumn{1}{c|}{59.3 / \textbf{81.9}} & 38.0 / \textbf{63.9} & 3.2 / \textbf{99.5}  & 57.5 / \textbf{95.1} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{ASH}       & 42.2 / \textbf{60.1} & 7.5 / \textbf{42.5}  & \multicolumn{1}{c|}{81.2 / \textbf{92.7}} & 51.4 / \textbf{55.3} & 2.7 / \textbf{98.8}  & 76.1 / \textbf{98.9} \\ \midrule
\multicolumn{1}{l|}{\multirow{9}{*}{CIFAR-10}}   & \multicolumn{1}{l|}{RMDS}      & 89.8 / \textbf{95.4} & 53.6 / \textbf{68.7} & \multicolumn{1}{c|}{58.5 / \textbf{61.3}} & 98.0 / 98.0 & 99.9 / 99.9 & 87.4 / 87.4 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{SHE}       & 98.0 / 98.0 & \textbf{95.1} / 94.9 & \multicolumn{1}{c|}{60.6 / 60.4} & 96.8 / 96.8 & 99.9 / 99.9 & 85.8 / 85.8 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{KLM}       & 88.8 / \textbf{96.1} & 90.9 / \textbf{93.9} & \multicolumn{1}{c|}{57.7 / \textbf{59.0}} & 78.9 / \textbf{94.1} & 68.4 / \textbf{92.2} & 80.5 / 80.5 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{OpenMax}   & 87.0 / \textbf{95.9} & 75.1 / \textbf{85.3} & \multicolumn{1}{c|}{71.4 / \textbf{71.7}} & 96.2 / \textbf{96.7} & 99.9 / 99.9 & 86.7 / \textbf{87.1} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{MLS}       & 20.2 / \textbf{94.0} & 8.4 / \textbf{35.5}  & \multicolumn{1}{c|}{67.7 / \textbf{72.5}} & 64.3 / \textbf{97.6} & 0.0 / \textbf{83.8}  & 90.1 / \textbf{90.2} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{TempScale} & 19.3 / \textbf{92.2} & 9.6 / \textbf{53.9}  & \multicolumn{1}{c|}{58.2 / \textbf{64.2}} & 93.1 / \textbf{98.5} & 0.0 / \textbf{99.6}  & 88.3 / \textbf{88.9} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{GradNorm}  & 3.7 / \textbf{86.9 } & 4.0 / \textbf{20.9}  & \multicolumn{1}{c|}{49.4 / \textbf{55.4}} & 11.9 / \textbf{71.5} & 0.0 / \textbf{91.9}  & 66.3 / \textbf{67.3} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{DICE}      & 4.8 / \textbf{92.5}  & 4.8 / \textbf{55.1}  & \multicolumn{1}{c|}{58.9 / \textbf{68.0}} & 52.4 / \textbf{93.1} & 0.0 / \textbf{88.6}  & 90.1 / 90.1 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{ASH}       & 22.5 / \textbf{95.1} & 9.1 / \textbf{50.3}  & \multicolumn{1}{c|}{68.5 / \textbf{72.6}} & 68.7 / \textbf{97.1} & 0.0 / \textbf{84.4}  & 90.3 / 90.3 \\ \midrule
\multicolumn{1}{l|}{\multirow{16}{*}{CIFAR-100}} & \multicolumn{1}{l|}{MDS}       & 96.7 / \textbf{97.0} & \textbf{94.5} / 94.4 & \multicolumn{1}{c|}{57.7 / 57.7} & 98.7 / 98.7 & 99.9 / 99.9 & 53.1 / 53.1 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{KNN}       & 76.5 / \textbf{96.3} & 56.5 / \textbf{86.2} & \multicolumn{1}{c|}{63.4 / \textbf{66.3}} & 99.1 / 99.1 & 64.8 / \textbf{97.3} & 76.0 / 76.0 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{ViM}       & 65.2 / \textbf{96.1} & 72.3 / \textbf{79.2} & \multicolumn{1}{c|}{60.0 / \textbf{61.6}} & 5.6 / \textbf{95.2}  & 0.1 / \textbf{83.7}  & 82.1 / 82.1 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{MSP}       & 8.0 / \textbf{84.7}  & 11.1 / \textbf{31.7} & \multicolumn{1}{c|}{49.7 / \textbf{50.9}} & 29.0 / \textbf{97.6} & 2.0 / \textbf{91.5}  & 75.0 / \textbf{75.1} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{EBO}       & 55.0 / \textbf{94.7} & 64.6 / \textbf{75.6} & \multicolumn{1}{c|}{60.2 / \textbf{61.6}} & 5.6 / \textbf{95.2}  & 0.1 / \textbf{83.3}  & 82.1 / 82.1 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{ReAct}     & 89.0 / \textbf{95.4} & 86.7 / \textbf{89.8} & \multicolumn{1}{c|}{59.7 / \textbf{61.7}} & 93.2 / \textbf{98.4} & 12.8 / \textbf{91.7} & 77.7 / 77.7 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{Gram}      & 4.9 / \textbf{11.4}  & 9.4 / \textbf{19.5}  & \multicolumn{1}{c|}{54.4 / \textbf{55.5}} & 1.2 / \textbf{79.6}  & 0.0 / \textbf{62.9}  & 72.3 / 72.3 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{RMDS}      & 89.0 / \textbf{90.5} & 45.0 / \textbf{52.6} & \multicolumn{1}{c|}{53.4 / \textbf{53.5}} & 99.2 / 99.2 & 99.8 / \textbf{99.9} & 71.7 / \textbf{71.8} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{SHE}       & \textbf{97.6} / 97.4 & 94.3 / \textbf{94.5} & \multicolumn{1}{c|}{58.0 / \textbf{58.6}} & 98.5 / 98.5 & 99.9 / 99.9 & \textbf{57.1} / 56.9 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{KLM}       & 93.8 / \textbf{95.5} & 84.5 / \textbf{93.3} & \multicolumn{1}{c|}{59.5 / \textbf{60.4}} & 55.4 / \textbf{98.1} & 42.6 / \textbf{98.4} & 75.5 / 75.5 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{OpenMax}   & 78.5 / \textbf{82.3} & 79.9 / \textbf{86.0} & \multicolumn{1}{c|}{55.1 / \textbf{55.9}} & 62.7 / \textbf{84.7} & 60.3 / \textbf{93.5} & 70.0 / \textbf{70.1} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{MLS}       & 46.9 / \textbf{67.7} & 36.0 / \textbf{49.2} & \multicolumn{1}{c|}{64.8 / \textbf{64.9}} & 4.4 / \textbf{97.9}  & 0.0 / \textbf{98.7}  & 81.0 / \textbf{81.5} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{TempScale} & 8.2 / \textbf{75.8}  & 9.2 / \textbf{31.1}  & \multicolumn{1}{c|}{48.5 / \textbf{49.0}} & 34.1 / \textbf{98.1} & 1.7 / \textbf{99.6}  & 75.5 / \textbf{75.8} \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{GradNorm}  & 3.4 / \textbf{78.3}  & 5.2 / \textbf{14.4}  & \multicolumn{1}{c|}{46.7 / \textbf{47.6}} & 1.5 / \textbf{96.7}  & 0.0 / \textbf{99.5}  & 80.9 / 80.9 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{DICE}      & 5.0 / \textbf{87.6}  & 5.6 / \textbf{57.5}  & \multicolumn{1}{c|}{51.8 / \textbf{55.5}} & 3.8 / \textbf{96.7}  & 0.1 / \textbf{99.4}  & 82.5 / 82.5 \\
\multicolumn{1}{l|}{}                            & \multicolumn{1}{l|}{ASH}       & 56.8 / \textbf{93.6} & 55.4 / \textbf{76.8} & \multicolumn{1}{c|}{62.1 / \textbf{62.8}} & 2.9 / \textbf{98.4}  & 0.0 / \textbf{98.9}  & 82.1 / 82.1 \\ \bottomrule
\end{tabular}
\label{tab:images_apd}
\end{table*}

\section{Model Calibration}\label{apd:calibration}

Overconfidence and calibration on ID sets are not the same, since whether a model is calibrated on ID data may not influence how its confidence changes on OOD points, because by definition OOD instances come from a different distribution. Moreover, theoretical results demonstrate that some architectures are always overconfident, regardless of their level of calibration on ID data \citep{hein2019relu, ulmer2021know}.

This said, there may indeed be architectures for which an improved calibration enhances OOD detection, but as far as we know this remains to be proven in general. To assess the impact of calibration on our outcomes, we have included temperature scaling \citep{guo2017calibration} combined with MSP among our baselines (called TempScale). Table \ref{tab:calibration} presents the expected calibration error (ECE) with and without temperature scaling for the ResNet model trained on our datasets, which quantifies the improvement in ID calibration of models after Temp Scaling. Also, the OOD detection results stated before indicate that temperature scaling improves the AUC of OOD detection over MSP only marginally: at most 1\% across all the experiments (e.g., see Figures \ref{fig:tabular} and \ref{fig:mimic_vs_eicu}).
This suggests that ID calibration might be beneficial for OOD detection, although it is not enough to make such a claim.

\begin{table}[ht]
\centering
\caption{The ECE (\%) using M~=~15 bins with and without temperature scaling for the ResNet model trained on our datasets.}
\begin{tabular}{cccccc}
\toprule
   Temp Scaling                  & MNIST & CIFAR10 & CIFAR100 & eICU & MIMIC-IV \\ \midrule
\ding{55} & 0.38  & 4.47    & 13.33                            & 1.81                         & 5.43     \\
\ding{51}
    & 0.31  & 1.28    & 4.39                             & 1.17                         & 2.63    \\ \bottomrule
\end{tabular}
\label{tab:calibration}
\end{table}

\section{All layers instead of penultimate layer}

In the proposed method, we utilize the activation values at the penultimate layer of the neural network. Here, we examine the impact of using all intermediate layers of neural networks rather than just one. To this end, we repeat the experiment from section 4.1 using $\alpha=1000$ to consider all the layers. The same algorithm that was applied to the penultimate layer is now applied to all layers and the outputs (normalized by the number of nodes in their respective layers) are summed together. 
%The plot comparing the results with only one or all layers are displayed in Fig. \ref{fig:method_with_all_layers} for eICU and Diabetics datasets. 
Fig. \ref{fig:method_with_all_layers} displays the comparison between only one or all layers, for eICU and Diabetics datasets.

According to this figure, both settings are effective in improving the OOD detection performance. However, the performance of all the detection methods is better with only one layer with the eICU dataset, while with the Diabetics dataset, many baselines get better results when all the layers are employed. 
Accordingly, while both setups are effective, the best option depends on the dataset and detection method. Still, note that the average performance on these two datasets is better using only the penultimate layer. 

\begin{figure}[t]
    \centering
        \includegraphics[scale=0.8]{images/method_with_all_layers.png}
        \caption{OOD detection performance with capturing extreme values in only the penultimate layer (y-axis) and in all the intermediate layers (x-axis). The eICU and Diabetics datasets serve as ID, and the OOD set is generated using $\alpha=1000$.}
    \label{fig:method_with_all_layers}
\end{figure}

\section{Other norms in CEA}
CEA measures the $\ell_2$ norm of activations exceeding a specified threshold. As stated in the main text, the choice of $\ell_2$ norm can potentially be substituted with other $\ell_p$ norms. Here, we evaluate how the utilization of $\ell_0$ and $\ell_1$ norms influences the outcomes. According to the results in Table \ref{tab:lp_norms}, these norms result in similar results to the $\ell_2$ norm. This means that CEA can be used with other reasonable norms as well. 

\begin{table}[ht]
\centering
\caption{AUC of OOD detection with CEA using $\ell_0$, $\ell_1$, or $\ell_2$ norms to calculate size of extreme activations. Datasets include eICU and Diabetics, OODs are synthesized by $\alpha=1000$, and baseline detection methods are MSP and EBO.}
\begin{tabular}{ccccccccc}
\toprule
\multirow{2}{*}{Method} & &\multicolumn{3}{c}{eICU} & & \multicolumn{3}{c}{Diabetics} \\
                      &  & $\ell_0$     & $\ell_1$     & $\ell_2$    & & $\ell_0$     & $\ell_1$     & $\ell_2$      \\ \midrule
MSP                 & &  88.2 &     88.3   &   88.4       &      &    88.1      &      88.2    &    88.2     \\
EBO                &  &  88.4 &   88.4     &   88.4     &        &       88.2   &    88.2      &    88.3    \\ \bottomrule
\end{tabular}
\label{tab:lp_norms}
\end{table}

\end{document}
