% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{graphicx}
\usepackage{dirtytalk}
\usepackage{multirow} %% Pour mettre un texte sur plusieurs rangées
\usepackage{multicol} %% Pour mettre un texte sur plusieurs colonnes
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{pifont}
\usepackage{soul}
\usepackage{cleveref}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{xcolor}
% \usepackage[caption=false,font=footnotesize]{subfig}
\usepackage[figuresright]{rotating}
\usepackage{wrapfig}

% \usepackage[toc,page,titletoc]{appendix}
\usepackage{appendix}
% \usepackage{lscape}
% \crefname{supp}{Supplement}{Supplements}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                 Authors MACROS Commands               %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros

\newcommand{\swap}[3][-]{#3#1#2} % just an example

\algdef{SE}[SUBALG]{Indent}{EndIndent}{}{\algorithmicend\ }%
\algtext*{Indent}
\algtext*{EndIndent}

\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\ding{55}}%


\newcommand{\ie}{\textit{i}.\textit{e}. }
\newcommand{\eg}{\textit{e}.\textit{g}. }

% Please comment lines below before submission
% \newcommand{\abc}[1]{\textcolor{teal}{[Fabio: \em #1]}}
% \newcommand{\ab}[1]{\textcolor{violet}{#1}}
% \newcommand{\reviewer}[1]{\textcolor{red}{#1}}
% \newcommand{\fa}[1]{\textcolor{blue}{[Fabio: \em #1]}}
% \newcommand{\FA}[1]{{\color[rgb]{1.000000,0.509804,0.278431} [FA: #1]}}
\newcommand{\ToDo}[1]{\textcolor{red}{[ToDo: #1]}}
\newcommand{\FA}[1]{{\color{orange} [FA: #1]}}
\newcommand{\DM}[1]{{\color{blue} [DM: #1]}}
\newcommand{\tabitem}{~~\llap{\textbullet}~~}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						PAPER TITLE							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{Latent Representation Entropy Density for Distribution Shift Detection}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<fabio.arnez@cea.fr>?Subject=LaREx UAI 2024 paper}{Fabio Arnez}{}}
\author[1]{Daniel Alfonso Montoya Vasquez}
\author[1]{Ansgar Radermacher}
\author[1]{Fran\c{c}ois Terrier}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Universit\'e Paris-Saclay, CEA, List\\
    F-91120, Palaiseau, France
}
% \affil[2]{%
%     Second Affiliation\\
%     Address\\
%     …
% }
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }
  
\begin{document}
\maketitle

\begin{abstract}
Distribution shift detection is paramount in safety-critical tasks that rely on Deep Neural Networks (DNNs). The detection task entails deriving a confidence score to assert whether a new input sample aligns with the training data distribution of the DNN model. While DNN predictive uncertainty offers an intuitive confidence measure, exploring uncertainty-based distribution shift detection with simple sample-based techniques has been relatively overlooked in recent years due to computational overhead and lower performance than plain post-hoc methods. This paper proposes using simple sample-based techniques for estimating uncertainty and employing the entropy density from intermediate representations to detect distribution shifts. We demonstrate the effectiveness of our method using standard benchmark datasets for out-of-distribution detection and across different common perception tasks with convolutional neural network architectures. Our scope extends beyond classification, encompassing image-level distribution shift detection for object detection and semantic segmentation tasks. Our results show that our method's performance is comparable to existing \textit{State-of-the-Art} methods while being computationally faster and lighter than other Bayesian approaches, affirming its practical utility. Code is available at \url{https://github.com/CEA-LIST/LaREx}.
\end{abstract}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\label{sec:introduction}

As highly automated systems increasingly rely on DNNs to perform safety-critical tasks, confidence representation in DNN predictions has become crucial when deployed in the open world. Trustworthy DNN models should provide accurate predictions and detect samples that differ from those observed in the training distribution. Therefore, capturing information about \textit{\say{what the model does not know}} is not only helpful but essential in safety-critical tasks and real-world deployment \citep{sun2021react}.


In image classification, multiple methods have been proposed for distribution shift detection by building DNN prediction confidence scores, among which post-hoc methods stand out mainly by their less-invasive nature and practical use~\citep{yang2021generalized,ruff2021unifying}. DNN predictive uncertainty offers a plain confidence representation. Existing Bayesian deep learning (BDL) methods provide a simple and principled approach to estimating DNN uncertainty. DNN predictive uncertainty with BDL methods has been used for detecting out-of-distribution (OoD) samples under the assumption that samples far away from the training distribution provide higher predictive uncertainty than samples observed in the training data~\citep{ovadia2019can,kendall2017uncertainties}.


While BDL sampling-based methods are conceptually straightforward (\eg, Monte-Carlo dropout), their practical implementation is hindered by substantial computational costs, limiting widespread adoption. Furthermore, recent works \citep{yang2021generalized,mukhoti2023deep} argue that BDL uncertainty is comparatively less effective for OoD detection when contrasted with more direct (deterministic) post-hoc methods.
In addition, these problems can scale up to more complex computer vision tasks. In semantic segmentation, the lack of information on semantic structures and contexts yields miss-matches between anomaly pixel masses and pixel uncertainty regions \citep{di2021pixel,xia2020synthesize}. In object detection, object distance and occlusion can impact the bounding-box predictive uncertainty for regression and classification ~\citep{feng2021review,wang2020robust}. Therefore, the limitations mentioned above lead to the open question: \textit{Are DNN uncertainty-based confidence scores, with simple sample-based methods, still competitive for distribution shift detection?}


In this paper, we propose to use the uncertainty from intermediate latent representations (feature maps and embeddings) to detect distribution shifts at the image level. We leverage the latent representation entropy density from the training dataset and propose two new confidence scores (fully defined in  \Cref{sec:Repre_entropy_density}) that we call \texttt{LaRED} \& \texttt{LaREM} (\texttt{LaREx} for short). Our approach offers compelling benefits: 1) OoD data agnostic, \ie, the score threshold is estimated only with in-distribution (InD) data; 2) simple post-hoc method that requires a single noise layer; 3) reduced runtime compared to sample-based BDL techniques and comparable to deterministic counterpart methods; 4) the presented scores can be applied to different CNN-based model architectures from different tasks. The paper contributions are summarized below:



\begin{enumerate}
    \item We present two uncertainty-based confidence scores (\texttt{LaREx}) for image-level distribution shift detection that are computationally efficient compared with other BDL methods. We combine the benefits of simple sample-based methods for uncertainty estimation with density and distance-based methods for OoD detection.

    \item We demonstrate the applicability of \texttt{LaREx} beyond image classification with more complex computer vision tasks, namely semantic segmentation and object detection tasks. Moreover, we show that image-level detection still has compelling benefits compared to more fine-grained detection schemes at the pixel or object level.


    \item We performed extensive experimentation comparing the proposed confidence scores with standard baselines and benchmarks. In addition, we performed ablation studies presenting perspectives on enhancing the practical effectiveness of \texttt{LaREx} encompassing aspects such as regularization, dimensionality reduction, and the DNN layer to collect representations samples. 
    

\end{enumerate}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Background}
\label{sec:background}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						 SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Problem Formulation}
\label{sec:problem_formulation}
Data distribution shift detection can be framed as a binary classification task. The classifier $\mathbf{\Omega}$ aims at using a confidence score $\mathcal{S}$ with a corresponding threshold $\tau$ to determine (at inference time) whether a new input sample $\vx^{*}$ belongs to the training data distribution or not (OoD, anomalous samples), as presented in \cref{eq:monitor_func}:
\begin{equation}
    \mathbf{\Omega} \Big( \mathcal{S}(\vx^{*}), \tau \Big)
    \begin{cases}
        1 \;\;\; InD \;\;\;\;\; \mathcal{S}(\vx^{*}) \geq \tau\\
        0 \;\;\; OoD \;\;\;\;\; \mathcal{S}(\vx^{*}) < \tau
    \end{cases}
    \label{eq:monitor_func}
\end{equation}
Therefore, following the equation above, the goal is to derive a confidence score such that--\textit{by convention in the literature}--positive InD samples have higher confidence scores and vice versa for OoD or anomalous input samples. Then, the classifier $\mathbf{\Omega}$ uses the confidence score $\mathcal{S}$ to get a notion of trust in the DNN and elicit its verdict.
% $\mathbb{P}^{+}$



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						 SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Related Work}
\label{sec:related_work}
In distribution shift detection, post-hoc methods aim to create confidence scores that have a minimal impact on the DNN architecture and the training process without altering the loss function. Post-hoc methods are presented below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Output-based Methods.} These methods aim at devising confidence scores based on the DNN outputs. \citet{hendrycks2016baseline} proposed the first simple baseline method that uses the maximum softmax probability (MSP) as an InD membership score. Later work suggests using the maximum logit to outperform MSP \citep{hendrycks2019scaling}. More recently, \citet{liu2020energy} proposed the energy score by summing up the prediction logits over all classes. In this line of work, ASH~\citep{djurisic2022extremely}, DICE~\citep{sun2022dice}, and ReAct~\citep{sun2021react} have worked on improving the 
energy score separability for InD and OoD data by modifying the activations of the penultimate layer and applying thresholding and scaling, sparsification, or clipping. In the context of uncertainty estimation, sample-based approximate Bayesian inference methods~\citep{gal2016dropout, lakshminarayanan2017simple} are used to generate multiple predictions for the same input sample, from which the predictive entropy and mutual information can be used as confidence scores~\citep{kirsch2021pitfalls,mukhoti2023deep}. Unlike these methods, we do not use the DNN outputs to build our confidence scores.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Density-based Methods.} These methods focus on modeling InD density using probabilistic models. In the context of discriminative models, deterministic uncertainty estimation methods \cite{postels2020hidden,blum2021fishyscapes,mukhoti2023deep} aim to estimate the embedding density while connecting to the traditional BDL approach.
Another line of work employs generative models to represent the training data distribution, assuming that high-likelihood values correspond to InD samples and low-likelihood values to OoD samples. However, \citet{nalisnick2018deep} showed that this assumption does not hold since the typical set of the data may not intersect with the high-likelihood region and adopt a typicality test approach using a batch of samples. \citet{choi2018waic} suggests that OoD data may receive higher likelihoods due to epistemic errors and proposes using an ensemble of density models to address this issue. Follow-up work from \citet{morningstar2021density} propose assessing the typicality through multiple summary statistics from the model and their corresponding density estimates to build a score for a single sample.
In contrast, our approach incorporates the ideas from both of the previous lines of work and uses the entropy density from intermediate representations to build our confidence scores.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Distance-based Methods.} These methods assume that OoD samples reside in farther locations than InD samples from the training reference examples. \citet{lee2018simple} proposed using the minimum Mahalanobis distance to all embedding centroids per class, assuming that the feature space follows a multivariate normal distribution. Recent work from \citet{sun2022out} shows promising results by following a non-parametric approach in the feature space and using the Kth nearest neighbor (KNN) distance. Other works \cite{techapanurak2020hyperparameter,nitsch2021out} use the cosine similarity between class embeddings and test sample embeddings as a confidence score.
Our proposed scores follow both the parametric and the non-parametric approach for entropy density estimation. The parametric version is used to compute the Mahalanobis distance.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Detection in complex computer vision tasks.} For object detection, \citet{du2022vos} proposed to modify the training procedure of an RCNN to synthesize virtual outliers in the feature space so that the energy score behaves differently for InD or OoD samples. More recently, \cite{wilson2023safe} proposed training an auxiliary network to distinguish hidden state activations across the backbone of an RCNN for InD or OoD samples by generating outliers as corrupted images in the input space. In semantic segmentation, recent benchmarks \cite{chan2021segmentmeifyoucan} present adapted common post-hoc methods for detecting anomalies at the pixel level despite the high execution runtime that hinders its practical utility. Instead, our approach proposes to detect shifts at the image level from a standpoint that is previous and complementary to signal the presence of potential finer anomalies. More on this discussion is found in \Cref{append:image_lvl_vs_detailed_lvl_detect}.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					      SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Method}
\label{sec:Method}

We propose an uncertainty-based confidence score that leverages the entropy from an intermediate DNN latent representation. Taking inspiration from \citet{morningstar2021density}, in our formulation, the DNN latent representation entropy is represented as a random variable $\Psi \sim f_{\Psi}(\psi)$, and we estimate its density by employing the InD training samples. Next, we use the estimated representation entropy density to build a confidence score that enables the detection of newly shifted samples (OoD samples). Below, we describe our approach to capture latent representation entropy, the InD entropy density $f_{\Psi}(\psi)$ estimation, and the confidence scores computation.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Latent Representations Uncertainty}
\label{subsec:lat_rep_uncertainty}

Key to our approach is the estimation of uncertainty from a DNN latent representation. A simple way to estimate uncertainty is by applying dropout \citep{srivastava2014dropout} to add multiplicative noise to latent representation $\tilde{\vz}$, as presented in \cref{eq:dropout}:

\begin{equation}
    \vz = \rvm \odot \tilde{\vz}, \;\; \textit{where} \;\; \rvm \sim \mathcal{B}(p_{m}) 
\label{eq:dropout}
\end{equation}
where $\rvm$ is the vector of independent \textit{Bernoulli} random variables---\textit{the dropout mask}---and $p_{m}$ is the drop probability that has the same dimension as~$\tilde{\vz}$. A vector $\vm$ is sampled and multiplied element-wise with the latent code $\tilde{\vz}$ to produce a modified \textit{\say{noisy}} latent code $\vz$, for which we would like to marginalize out the dropout mask noise as follows:
\begin{equation}
    p_{\theta}(\rvz \mid \vx) = \int{p_{\theta}(\rvz \mid \vx, \vm) \underbrace{p(\vm) \;d\vm}_{\textit{dropout masks}}}
\label{eq:z_dropout_expectation}
\end{equation}
Thus, to get the uncertainty of the latent code $\rvz$, we take multiple samples from $\rvm$ to generate multiple dropout masks so that we can produce a set of $M$ samples $\rvz$, $\{\vz_{i}\}_{i=1}^{M}$ that approximate \cref{eq:z_dropout_expectation}. This set of samples, produced with a DNN with weights $\theta$ and input $\vx$, help us characterize the sampling distribution $p_{\theta}(\rvz \mid \vx)$, whose entropy is presented in~\cref{eq:entropy_z_given_x}.
\begin{equation}
    \mathbb{H}(\rvz \mid \vx) = -\int{p_{\theta}(\vz  \mid \vx)\, \ln \,p_{\theta}(\vz  \mid \vx)}\,d\vz
\label{eq:entropy_z_given_x}
\end{equation}

From a practical point of view, we need a single dropout layer to get the samples $\{\vz_{i}\}_{i=1}^{M}$ to approximate the integral from~\cref{eq:z_dropout_expectation}. In addition, during deployment, this situation allows us to speed up the sampling acquisition since we no longer need to pass an input sample throughout the whole DNN. We perform a single forward pass for a given input sample and capture the latent representation just before the target dropout layer. Then, we apply different dropout masks to the captured latent representation.

Our approach to capture the latent representation uncertainty is akin to the Monte-Carlo dropout (MCD)~\citep{gal2016dropout} method for Bayesian approximation. However, our method differs since we apply dropout to produce multiple noisy versions of the representation. Thus, to distinguish with MCD, we use the term \textit{z Monte-Carlo dropout (zMCD)} henceforth.
We refer the reader to \Cref{append:MCD_zMCD_relation} for additional insights between zMCD and MCD from approximate Bayesian inference.





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{zMCD on feature maps.} Standard dropout is ineffective when applied to convolutional neural networks (CNNs) since it does not remove semantic and spatial information from CNN feature maps. On the other hand, dropping continuous regions in 2D feature maps with \textit{DropBlock} can help remove semantic information and enforce remaining units to learn features for the assigned task \citep{ghiasi2018dropblock}. This effect is also desired for capturing uncertainties to overcome the standard dropout limitation. Therefore, we follow the approach from \citet{deepshikha2021monte} and use DropBlock to capture the uncertainty from feature maps.


\textbf{Feature map processing.} CNN feature maps are of the form $\vz \in \mathbb{R}^{C\times H  \times W}$, where $C$, $H$ and $W$ denote the feature map number of channels, height, and width respectively. We compute the mean of the feature map across the spatial dimensions ($H$ and $W$) so that the latent feature representation is reduced to a vector:
\begin{equation}
    \vz_{\mu_{c}} = \frac{1}{HW}\sum_{h=1}^{H} \sum_{w=1}^{W} \vz(c, h, w),\; \text{where } \vz_{\mu_{c}} \in \mathbb{R}^{C}
\label{eq:z_mean_h_w}
\end{equation}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Representation Entropy Density for Detecting Distribution Shifts}
\label{sec:Repre_entropy_density}

\begin{figure*}[t!]
    \centering
    \includegraphics[width=0.93\linewidth]{Figures/larex_method.png} % original 0.80
    \caption{\texttt{LaRED} \& \texttt{LaREM} confidence score overview. The left part of the figure shows the extraction of clean latent feature maps. In the center, multiple DropBlock masks are applied to the extracted feature map to get the uncertainty from the latent representation $p_{\theta}(\rvz \mid \vx)$. The upper part of the figure depicts the score setup computation to get the entropy density estimates $\hat{f}_{\Psi}$. The lower part of the figure shows the score computation during deployment.}
    \label{fig:lared_and_larem_scores}
\end{figure*}

To start the entropy computation for detecting shifted samples, we first assume access to a training dataset $\mathcal{D}_{t}=\{ \vx_{n}, \vy_{n} \}_{n=1}^{N}$ with~$N$ samples. Now, 
we generate a set of zMCD samples $\{\vz_{i}\}_{i=1}^{M}$ for each training sample $\vx_{n}$
The resulting zMCD samples can then be used to approximate the entropy from~\cref{eq:entropy_z_given_x}, using standard entropy estimators methods~\citep{kozachenko1987sample}:
\begin{equation}
    \mathbb{\hat{H}}_{n} \big( \{\vz_{i}\}_{i=1}^{M} \big) \approx \mathbb{H}_{n}(\rvz \mid \vx_{n})
\label{eq:entropy_hat_ind_samples}
\end{equation}
Consequently, we produce entropy estimation vector samples $\{\psi_{n}\}_{n=1}^{N}$ for the training dataset~$\mathcal{D}_{t}$ (InD) samples:
\begin{equation}
\begin{aligned}
    \psi &= \hat{\mathbb{H}}(\rvz \mid \vx)\\
    \{\psi_{n}\}_{n=1}^{N} &= \hat{\mathbb{H}}_{n}(\rvz \mid \vx_{n}), \forall \vx_{n} \in \mathcal{D}_{t}
\end{aligned}
\label{eq:entropy_hat_samples}
\end{equation}


The entropy estimation samples $\{\psi_{n}\}_{n=1}^{N}$ from $\mathcal{D}_{t}$ are used to estimate the InD entropy density function $f_{\Psi} \approx \hat{f}_{\Psi}$. $f_{\Psi}$ is estimated using Kernel Density Estimation (KDE), or we assume that $f_{\Psi}$ is a multivariate Normal distribution, parameterized by the estimated mean $\hat{\mu}_{\psi}$ and covariance $\hat{\Sigma}_{\psi}$ from $\{\psi_{n}\}_{n=1}^{N}$, as shown in \cref{eq:hat_h_kde_pdf} and \cref{eq:hat_normal_h_pdf} respectively.
\begin{equation}
    \hat{f}_{\Psi} = \hat{f}_{KDE}\big( \{\psi_{n}\}_{n=1}^{N} \big)
\label{eq:hat_h_kde_pdf}
\end{equation}
\begin{equation}
    \hat{f}_{\Psi} = \mathcal{N}\big( \hat{\mu}_{\psi}, \hat{\Sigma}_{\psi} \big)
\label{eq:hat_normal_h_pdf}
\end{equation}



At test or deployment time, we use the estimated InD entropy density $\hat{f}_{\Psi}$ to produce a confidence score for a new input sample $\vx^{*}$. To this end, we produce a set of zMCD samples $\{\vz_{i}^{*}\}_{i=1}^{M}$ to estimate the latent representation $\rvz^{*}$ entropy vector for a new input sample $\vx^{*}$:
\begin{equation}
    \psi_{\vx^{*}} = \hat{\mathbb{H}}(\rvz^{*} \mid \vx^{*})
\label{eq:entropy_hat_new_sample}
\end{equation}
In the case of \texttt{LaRED}---\textit{Latent Representation Entropy Density log-likelihood}---score, we compute the log-likelihood of the entropy estimation $\psi_{\vx^{*}}$ for a new input sample $\vx^{*}$, using the estimated entropy density function from \cref{eq:hat_h_kde_pdf}:
\begin{equation}
    \texttt{LaRED}(\vx^{*}) = \log \hat{f}_{KDE} \big( \psi_{\vx^{*}} \big)
\label{eq:lared_score}
\end{equation}
\Cref{eq:lared_score} is equivalent to the confidence score from \citet{morningstar2021density}. However, in our confidence score, we use a single summary statistic instead of multiple summary statistics---\ie,the latent representation entropy.



For the \texttt{LaREM}---\textit{Latent Representation Entropy density Mahalanobis distance}---score, we compute the negative Mahalanobis distance, using the estimated density $\hat{f}_{\Psi}$ parameters from~\cref{eq:hat_normal_h_pdf} and the entropy estimation $\psi_{\vx^{*}}$ for $\vx^{*}$:
\begin{equation}
    \texttt{LaREM}(\vx^{*}) = -\Big(  \big( \psi_{\vx^{*}} - \hat{\mu}_{\psi} \big)^{\top} \hat{\Sigma}_{\psi}^{-1}  \big( \psi_{\vx^{*}} - \hat{\mu}_{\psi} \big) \Big)
\label{eq:larem_score}
\end{equation}
\Cref{eq:larem_score} is based on the score from \citet{lee2018simple}. However, we do not perform per-class centroid distance computations. Moreover, the \texttt{LaREM} score uses negative distance values to align with the convention where InD samples have higher confidence score values.

\textbf{Entropy vector dimensionality reduction.} Following previous works \citep{lee2018simple,postels2020hidden,yang2023full}, we apply principal components analysis (PCA) to reduce the dimensionality of the obtained entropy vectors $\psi_{\vx^{*}}$. Entropy vectors have the same dimensions as the latent code $\vz$ or $\vz_{\mu_{c}}$. Thus, the goal is to reduce the dimensions from  $C$ to $C^{\prime}$ so that $\psi_{\vx^{*}} \in \mathbb{R}^{C^{\prime}}$, where $C^{\prime} < C$. Applying PCA is particularly important for the \texttt{LaRED} score, given the common limitations of the KDE algorithm in high-dimensional spaces.

\Cref{fig:lared_and_larem_scores} shows our approach to capturing uncertainty from latent representations (as described in \Cref{subsec:lat_rep_uncertainty}) and presents an overview of both \texttt{LaRED} \& \texttt{LaREM} confidence score setup and computation during deployment. \texttt{LaRED} \& \texttt{LaREM} computation details are available in \Cref{supp-sec:algo}.









%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experiments \& Results}
\label{sec:experiments_results}

The experimental evaluation in this section aims to answer the following questions: 1) How do \texttt{LaRED} \& \texttt{LaREM} scores perform compared to other post-hoc baseline methods for distribution shift detection? 2) How do different design choices affect \texttt{LaRED} \& \texttt{LaREM} performance? 3) Can \texttt{LaRED} \& \texttt{LaREM} scale to more complex computer vision tasks with different DNN architectures?

\textbf{Evaluation Metrics.} We select three common metrics for detecting misclassified shifted (OoD and anomalous) samples to evaluate the proposed method. These metrics are: 1) \textbf{FPR95} measures the false positive rate (FPR) of OoD samples when the true positive rate (TPR) of InD samples is 95\%; 2) the area under the receiving operating characteristic curve \textbf{AUROC}; and 3) the area under the precision-recall curve \textbf{AUPR}.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Image Classification}
\label{subsec:simple_class}

\textbf{Experiment setup.} For the classification task we use a standard ResNet-18 \citep{he2016deep} DNN trained with the CIFAR-10 (InD) dataset. For the OoD detection evaluation, we consider SVHN \citep{netzer2011reading}, Places365 \citep{zhou2017places}, LSUN-Crop~\citep{yu2015lsun}, LSUN-Resize~\citep{yu2015lsun}, Textures \citep{cimpoi2014describing}, iSUN~\citep{xu2015turkergaze}, and Fashion MNIST~\citep{xiao2017fashion} datasets. We compare \texttt{LaREx} with common post-hoc detection methods from the literature that do not require additional OoD data. Uncertainty-based detection methods use 16 MC samples, \ie, predictive entropy and predictive mutual information (MI), and \texttt{LaREx} (using MCD and zMCD, respectively). Additional experiment details are provided in \Cref{append:img_class_exp_detail}.


\textbf{Results.} \Cref{table:rn18_class_cifar10_larem_lared_results} presents the average detection performance results over all the considered OoD datasets and the baselines described in \Cref{sec:related_work}. The results show that, in general, \texttt{LaREx} performance is on par with post-hoc baseline methods while being faster than other BDL approaches, as discussed in \Cref{subsec:runtime_exec}. In particular, \texttt{LaRED} \& \texttt{LaREM} occupy the second and third positions, respectively, after KNN~\citep{sun2022out}, denoting the benefits of not imposing a distributional assumption for the latent space. For both, \texttt{LaRED \& LaREM}, the best results are obtained without applying PCA. Interestingly, methods that aim at improving the Energy score (React~\citep{sun2021react}, DICE~\citep{sun2022dice}, and ASH~\citep{djurisic2022extremely}) have worse performance than the vanilla Energy score \citep{liu2020energy}. We believe that the drop in performance can be due to a sub-optimal parameter selection, \ie, we used the best parameters proposed by the authors of each baseline without trying to find if other parameters performed better for this benchmark. Finally, for the other uncertainty-based methods (BDL w/MCD), the performance drop is more noticeable, validating prior work~\citep{kirsch2021pitfalls,mukhoti2023deep} observations.




\textbf{Where to collect zMCD samples?} To answer this question, we need to add a noise layer (DropBlock or dropout layer) at different locations of the neural network to enable zMCD sampling. To this end, we take into account the output of each residual block of the ResNet-18 as ideal places to take zMCD samples. We use DropBlock at the outputs of residual blocks 1 to 3 and a dropout layer for the output of the residual block 4. The dropout layer is used in the last position, given the dimensions of the embedding representation after the avg. pooling. \Cref{subfig:LaREx_zMCD_pos_auroc_aupr_fpr} shows the average detection performance when samples are taken using DropBlock or Dropout on different locations of the DNN described above. For both, \texttt{LaRED} \& \texttt{LaREM}, the performance peaks when DropBlock is placed at the output of residual block-2 to take zMCD samples.




\textbf{DropBlock size matters.} \Cref{subfig:LaREx_db_size_auroc_aupr_fpr} presents the average detection performance across all OoD datasets for three different DropBlock sizes and a fixed drop probability of 0.5. For both methods, the performance is similar for block sizes of 3x3 and 5x5, with a subtle difference that favors a block size of 5x5 when inspecting the FPR95 results. A block size of 8x8 has a more noticeable drop in performance, affecting both methods. In this case, we attribute this effect to the fact that bigger DropBlock sizes tend to remove more relevant information, which can be vital for our methods.

\begin{table}[t!]
\centering
% \footnotesize
% \scriptsize
\scriptsize
\caption{Image classification average detection performance results across seven OoD datasets. All the detection methods use the same DNN trained with CIFAR-10 (InD) dataset and with all the regularization options from \Cref{fig:DNN_regularization_impact_larex}. The best results are shown in \textbf{bold}, second best are \underline{underlined}.}
\begin{tabular}{@{}llll@{}}
\toprule
\multicolumn{1}{c}{\textbf{Method}} &
  \multicolumn{1}{c}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{c}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{c}{\textbf{AUPR $\uparrow$}} \\ \midrule
MSP         & 61.80 ± 9.60  & 84.44 ± 3.49  & 85.88 ± 3.80  \\
Pred. Entropy  & 53.76 ± 16.98 & 88.44 ± 4.76  & 89.32 ± 4.96  \\
Pred. MI      & 82.99 ± 9.17  & 77.29 ± 5.81  & 79.19 ± 5.38  \\
Energy              & 47.96 ± 21.56 & 89.23 ± 7.18  & 89.27 ± 8.04  \\
ASH         & 63.96 ± 18.21 & 80.87 ± 10.03 & 79.82 ± 12.52 \\
ReAct                & 93.10 ± 2.58  & 53.76 ± 4.26  & 53.06 ± 4.77  \\
DICE                  & 81.51 ± 16.13 & 64.47 ± 12.10 & 62.19 ± 11.14 \\
DICE+ReAct                            & 92.31 ± 3.27  & 54.71 ± 5.73  & 54.27 ± 6.14  \\
KNN                    & \textbf{32.90 ± 20.30} & \textbf{92.65 ± 6.23}  & \textbf{92.63 ± 6.32}  \\
Mahalanobis         & 57.35 ± 27.00 & 80.00 ± 11.45 & 78.70 ± 11.63 \\
LaRED(ours)                            & \underline{33.16 ± 20.29} & \underline{90.80 ± 6.50} & \underline{90.80 ± 6.50}  \\
LaREM(ours)                            & 37.33 ± 20.03 & 89.20 ± 6.62  & 87.60 ± 7.05  \\ \bottomrule
\end{tabular}
\label{table:rn18_class_cifar10_larem_lared_results}
\end{table}

\begin{figure}[t!]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=0.49\linewidth]{Figures/img_class/larex_pos_auroc_aupr.png}
        \includegraphics[width=0.49\linewidth]{Figures/img_class/larex_pos_fpr.png}
        \caption{DNN place for zMCD samples}
        \label{subfig:LaREx_zMCD_pos_auroc_aupr_fpr}
    \end{subfigure}
    \begin{subfigure}{\linewidth}
        \includegraphics[width=0.49\linewidth]{Figures/img_class/larex_dbsize_auroc_aupr.png}
        \includegraphics[width=0.49\linewidth]{Figures/img_class/larex_dbsize_fpr.png}
        \caption{DropBlock size}
        \label{subfig:LaREx_db_size_auroc_aupr_fpr}
    \end{subfigure}
\caption{Impact of LaRED \& LaREM design choices on average detection performance across all OoD data sets.}
\label{fig:LaREx_design_choices}
\end{figure}

\textbf{Regularization improves performance.} We consider the impact of adding an extra dropout (DO) layer and data augmentation (DA) as simple ways to increase DNN regularization. The additional DO regularization layer is placed at the output of the ResNet encoder before the last linear layer. In addition, motivated by prior work on deterministic uncertainty estimation methods \cite{mukhoti2023deep,liu2020simple}, we also consider Spectral Normalization (SN) regularization. However, based on the work from \citet{ghosh2020from},
we use SN only with the layers after the output of the residual block where we placed the DropBlock layer to regularize the latent space from where we take the zMCD samples. ~\Cref{fig:DNN_regularization_impact_larex} shows the average detection performance results across all the OoD datasets. In this figure, it is possible to observe that regularization impacts the performance of our method (and of baselines too). DA alone has a higher positive impact on performance compared to SN or DO. SN outperforms DO when applied alone. However, 
when DA is combined with DO, it outperforms DA+SN, validating the importance of noise injection during training. Moreover, applying all of them seems to be the most beneficial option for all OoD detection methods. We used a DNN trained with all the regularization options to compare our approach and the baseline methods from~\Cref{table:rn18_class_cifar10_larem_lared_results}. We refer the reader to \Cref{append:img_class_exp_detail} for further details.





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Object Detection}
\label{subsec:obj_detect}
\textbf{DNN model.} For the object detection task, we built on the work from \citet{du2022vos} to detect distribution shifts with the Faster-RCNN architecture~\citep{ren2015faster}. The main difference between our approach is that VOS \citep{du2022vos} aims to detect shifts at the object level. We, instead, aim to detect shifts at the image level. We use the Faster-RCNN vanilla pre-trained model from \citet{du2022vos}, trained on BDD-100K \citep{yu2020bdd100k}. The models are implemented with the Detectron2 library \citep{Detectron2018} with a ResNet-50 \citep{he2016deep} backbone.


\begin{figure}[t!]
    \centering
    \includegraphics[width=0.82\linewidth]{Figures/img_class/regularization_larex.png} %original 0.84
    \caption{DNN regularization impact on LaRED \& LaREM performance. DO: Dropout, SN: Spectral Normalization, DA: Data Augmentation.}
    \label{fig:DNN_regularization_impact_larex}
\end{figure}


\textbf{Experiment setup.}  We fine-tuned the trained model into two versions: the first one with a DropBlock layer and fine-tuned both the RPN and the RoI heads. To fine-tune and gather zMCD samples from the RPN, the DropBlock layer had a block size of 4 and a drop probability of 0.5 and was applied after the first convolutional layer. This layer outputs 256 feature maps of varying sizes (12x21, 24x42, 48x84, 96x168, and 192x336) for a total of 1280 feature maps. These feature maps are reduced by taking the mean of each of them as presented in~\cref{eq:z_mean_h_w} to end up with simplified representation and entropy vectors of dimension 1280. In the second version, we added the dropout layer after the penultimate layer of the network (the Box Head, BH) and fine-tuned only the RoI heads (Box Head and Box predictor). To fine-tune and capture zMCD samples from the box head, we used a drop probability of $0.5$. The output of this layer consists of a tensor of size 1000x1024 corresponding in the first dimension to the 1000 boxes highest ranked by the box pooler and the second dimension for the feature map for each of these boxes. To reduce the dimension of this feature map, we extracted the mean per box, obtaining an embedding of 1000 components. For the evaluation, we used as OoD the same splits provided by \citet{du2022vos} of the MS-COCO \citep{lin2014microsoft} and OpenImages \citep{kuznetsova2020open} datasets. Additional experiment details are provided in \Cref{append:obj_detect_exp_detail} from the supplementary material.



\textbf{Baseline Methods.} We implemented and adapted the following baselines for the object detection case: MSP \citep{hendrycks2016baseline}, Mutual information \citep{gal2016uncertainty}, Predictive entropy \citep{gal2016uncertainty}, energy score \citep{liu2020energy}, DICE \citep{sun2022dice}, ReAct \citep{sun2021react}, DICE+ReAct, and ASH \citep{djurisic2022extremely}. For the energy-based methods, we implemented and evaluated two versions of each one: using the raw (R) output of the network (of 1000 results per image) and using the filtered (F) results after non-maximum suppression (NMS) (with variable size, typically of about 10-15 results per InD image). For MSP, pred. entropy, and mutual information, we took the output of the network after NMS. For ASH, we used the 80th percentile for pruning; for DICE, we used the 90th percentile for sparsifying; and for ReAct, we also used the 90th percentile for clipping. For SAFE \citep{wilson2023safe} and VOS \citep{du2022vos}, we report the results from their respective papers.


\textbf{Results.} \Cref{table:obj_detect_bdd100k_LaRED} summarizes the results for \texttt{LaREx} when using zMCD samples from the object detector RPN and the box head. 
For the version where the samples are collected from the RPN, \texttt{LaRED} w/40-PCA components and \texttt{LaREM} w/56-PCA components perform better than the version where samples are taken from the BH. In the latter, \texttt{LaRED} w/2-PCA components has better results,  presumably thanks to the dimensionality of the latent representations. This agrees with our previous results for image classification, where placing the DropBlock layer at a more intermediate location in the DNN leads to better results than placing it closer to the output. In general, both \texttt{LaRED} \& \texttt{LaREM} show a competitive performance compared to the other adapted baselines. In particular, \texttt{LaRED} shows the best AUROC results. Interestingly, our adapted F-ReAct, F-DICE, and F-ASH methods improve the F-Energy score detection performance results, and R-DICE shows the best results for OpenImages detection.
Furthermore, despite not being a fair comparison, we report the results from VOS and SAFE showing that image level detection performance, in general, surpasses object level detection from recent works.


\begin{table}[!t]
\centering
\scriptsize
\caption{Object detection OoD detection results. \texttt{LaRED} is applied at two different places of the Faster-RCNN DNN object detector trained with the BDD-100K dataset (InD dataset). The $^{\dag}$ symbol denotes the 2nd DNN version that applies fine-tuning only to the RoI heads. The best results are in \textbf{bold} for each metric, and the second best are \underline{underlined}. The $^{\clubsuit}$ symbol indicates the results as reported in \citet{du2022vos,wilson2023safe}.}
\label{table:obj_detect_bdd100k_LaRED}
\begin{tabular}{@{}lllll@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Method}}} &
  \multicolumn{2}{c}{\textbf{COCO}} &
  \multicolumn{2}{c}{\textbf{OpenImages}} \\ \cmidrule(l){2-5} 
\multicolumn{1}{c}{} &
  \textbf{FPR95 $\downarrow$} &
  \textbf{AUROC $\uparrow$} &
  \textbf{FPR95 $\downarrow$} &
  \textbf{AUROC $\uparrow$} \\ \midrule
R-Energy                    & 1.27±1.20          & 99.70±1.02                   & 0.22±0.18          & 99.87±0.10          \\
F-Energy                    & 18.08±1.28         & 92.27±1.98                   & 18.17±1.65         & 90.56±1.03          \\
R-ASH                       & 68.88±1.02         & 68.98±1.49                   & 80.40±1.47         & 62.45±0.82          \\
F-ASH                       & 3.35±1.84          & 99.27±0.96                   & 3.12±1.03          & 99.25±0.05          \\
R-React                     & 29.68±0.98         & 94.92±1.74                   & 18.51±1.48         & 96.92±1.47          \\
F-React                     & 2.18±1.52          & 99.56±1.38                   & 05.11±1.66         & 98.90±1.06          \\
R-DICE                      & 32.44±1.48         & 93.80±1.32                   & \textbf{0.01±0.01} & \textbf{99.98±0.02} \\
F-DICE                      & 19.20±1.32         & 94.78±1.59                   & 20.55±1.21         & 94.67±1.58          \\
R-DICE+ReAct                & 24.94±1.74         & 94.36±1.46                   & 97.55±1.16         & 52.14±1.98          \\
F-DICE+ReAct                & 64.99±1.63         & 63.70±0.99                   & 73.63±1.55         & 76.03±1.53          \\ \midrule
MSP                         & \textbf{0.21±0.87} & \underline{99.79±0.68} & 0.11±0.06          & 99.88±0.71          \\
Pred. Entropy               & 68.88±1.02         & 68.98±1.49                   & 80.40±1.47         & 62.45±0.82          \\
Pred. MI                    & 54.46±1.14         & 77.98±1.52                   & 21.35±1.26         & 85.45±0.97          \\ \midrule
LaRED RPN (ours) &
  \underline{0.31±0.30} &
  \textbf{99.81±0.40} &
  0.22±0.21 &
  99.88±0.60 \\
LaREM RPN (ours) &
  0.74±0.42 &
  99.77±0.29 &
  \underline{0.10±0.08} &
  \underline{99.91±0.08} \\
LaRED BH $^{\dag}$ (ours)    & 12.07 ±0.60        & 97.48±0.80                   & 10.33±1.20         & 97.54±0.90          \\ \midrule
VOS-ResNet50$^{\clubsuit}$  & 44.27±2.0          & 86.87±2.1                    & 35.54±1.7          & 88.52±1.3           \\
SAFE-ResNet50$^{\clubsuit}$ & 32.56±0.8          & 88.96±0.6                    & 16.04±0.5          & 94.64±0.3           \\ \bottomrule
\end{tabular}
\end{table}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Semantic Segmentation}
\label{subsec:sem_seg}

\textbf{DNN models.} For the semantic segmentation task, we concentrate on the application of the proposed method with the DeepLabv3+ \citep{chen2018encoder}, and U-Net \citep{ronneberger2015u} architectures.
To apply \texttt{LaREx}, we added a DropBlock layer at the output of both DeepLabv3+ and U-Net encoders using a block size of 8x8 and a drop probability of 0.5 to take zMCD samples. We train both DNNs with Cityscapes~\citep{cordts2016cityscapes} and Woodscape~\citep{Yogamani_2019_ICCV} datasets. DeepLabv3+ models $\mathcal{M}_1$ and $\mathcal{M}_2$, and U-Net models $\mathcal{M}_3$ and $\mathcal{M}_4$, respectively. Additional details are provided in \Cref{{append:sem_seg_exp_detail}}.



\textbf{Evaluation Datasets.} Motivated by \citet{ahmed2020detecting} who argue that semantically similar samples are of practical relevance, we consider data with covariate shift for the experiments. When DNN is trained with Woodscape, we use Cityscapes data for the evaluation and vice-versa. Moreover, we consider the Failure Mode Effect Analysis in perception tasks from \citet{ceccarelli2022rgb} and take into account InD data with perturbations and anomalies. Therefore, we include Cityscapes and Woodscape datasets with synthetic anomalies and the Woodscape-soiling~\citep{Yogamani_2019_ICCV} dataset for the evaluation.

\begin{table}[t!]
\centering
\scriptsize
\caption{Semantic segmentation average distribution shift detection results for the evaluation datasets in DeepLabv3+ ($\mathcal{M}_1$, $\mathcal{M}_2$) and U-Net ($\mathcal{M}_3$, $\mathcal{M}_4$) architectures trained with Cityscapes ($\mathcal{M}_1$, $\mathcal{M}_3$) and Woodscape ($\mathcal{M}_2$, $\mathcal{M}_4$) datasets respectively. The best results are in \textbf{bold} for each metric.}
\label{table:cs_ws_semseg_larem_lared_results}
\begin{tabular}{@{}llccc@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{ID}}} & \multirow{2}{*}{\textbf{Methods}} & \multicolumn{3}{c}{\textbf{Evaluation Datasets Average Performance}}               \\ \cmidrule(l){3-5} 
\multicolumn{1}{c}{}                             &                                   & \textbf{FPR95 $\downarrow$} & \textbf{AUROC $\uparrow$} & \textbf{AUPR $\uparrow$} \\ \midrule
\multirow{4}{*}{$\mathcal{M}_1$} & Mahalanobis & \textbf{1.07 ± 1.85}   & \textbf{99.70 ± 0.46} & \textbf{99.75 ± 0.38} \\
                                 & KNN         & 8.28 ± 13.92  & 98.37 ± 2.38 & 98.53 ± 2.19 \\
                                 & LaREM       & 3.00 ± 5.20   & 99.39 ± 0.98 & 99.42 ± 0.93 \\
                                 & LaRED-58    & 10.87 ± 18.60 & 97.04 ± 3.96 & 97.07 ± 4.16 \\ \midrule
\multirow{4}{*}{$\mathcal{M}_2$} & Mahalanobis & \textbf{1.59 ± 2.34}   & \textbf{99.36 ± 0.48} & \textbf{99.58 ± 0.34} \\
                                 & KNN         & 7.96 ± 9.89   & 97.72 ± 1.36 & 96.09 ± 4.24 \\
                                 & LaREM       & 21.27 ± 13.73 & 93.57 ± 2.64 & 89.17 ± 5.40 \\
                                 & LaRED-50    & 12.60 ± 8.06  & 96.24 ± 2.13 & 95.48 ± 3.01 \\ \midrule
$\mathcal{M}_3$                  & LaRED-50    & 17.79 ± 12.04 & 95.29 ± 3.84 & 95.02 ± 4.86 \\
$\mathcal{M}_4$                  & LaRED-50    & 20.15 ± 14.61 & 95.42 ± 4.09 & 95.96 ± 4.05 \\ \bottomrule
\end{tabular}
\end{table}

\textbf{Baseline Methods.} In this task, we discarded post-hoc methods based on DNN outputs since we do not target pixel-level anomaly detection. Instead, we use the Mahalanobis \citep{lee2018simple} and KNN \citep{sun2022out} distance as baseline methods to compare \texttt{LaRED} \& \texttt{LaREM}. However, these distance-based methods are different from those that target pixel-level anomaly detection \citep{chan2021segmentmeifyoucan}. Both methods use the representations from the penultimate layer. However, since the representations are now 2D feature maps, we take the mean of the feature maps as presented in~\cref{eq:z_mean_h_w}. Furthermore, for the Mahalanobis distance, we calculate a single entropy vector $\psi$ mean for the training (InD) set instead of dedicated means for each class.


\textbf{Results.} \Cref{table:cs_ws_semseg_larem_lared_results} presents the results for the models of both architectures for semantic segmentation. For both DeepLabv3+ models, \texttt{LaREM} w/o PCA has the best results, while \texttt{LaRED} w/58 PCA components and \texttt{LaRED} w/50 PCA components show the best results for the DNN models trained with Cityscapes and Woodscape, respectively. In all the models, \texttt{LaREx} performance is comparable to the other distance-based baselines. The reason for \texttt{LaREx} performance difference can be attributed to a sub-optimal selection of the parameters (\eg, DropBlock location and size, PCA components) and to the presence of \say{clean} InD images in the evaluation datasets (in particular the case of Woodscape soiling dataset) that can be handled by the inherent robustness of the DNN. In contrast to the image classification case, for semantic segmentation, the Mahalanobis distance in both Deeplabv3+ models has the best performance results across the evaluated datasets, outperforming the KNN distance method. Presumably, the drop in KNN performance is due to the sub-optimal selection of the kth nearest neighbor. Note that we used the parameters proposed by the authors of each baseline, as in the image classification experiments. Next, in both U-Net models, \texttt{LaRED} w/50 PCA components show the best performance results. In general, the obtained results further validate the importance and effectiveness of the feature maps reduction by computing the mean to get a simplified representation as presented in \cref{eq:z_mean_h_w}.





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{LaREx Runtime Execution}
\label{subsec:runtime_exec}
% \textbf{Runtime.} \Cref{table:inf_time_results} shows the runtime results 
The runtime results are presented in \Cref{table:inf_time_results}
for \texttt{LaRED} \& \texttt{LaREM} and Predictive entropy with MCD-BDL in a DeepLabv3+ trained with Cityscapes. Although this can not be considered a completely fair comparison, given that MCD-BDL provides a pixel-level confidence score with the predicted uncertainty maps, the runtime results highlight the benefits of our approach in uncertainty-based distribution shift detection. For both, \texttt{LaRED} \& \texttt{LaREM}, most of the computation time is dedicated to the score computation after the zMCD sampling step. \texttt{LaREx} sampling is faster since there is no need to perform a complete forward pass of the input samples as in the case of BDL-MCD. Nevertheless, most of \texttt{LaREx} runtime budget can presumably be attributed to data-transfer operations (GPU-RAM-CPU) for entropy estimation and the score computation.


\begin{table}[t!]
\centering
\scriptsize
\caption{Deeplabv3+ uncertainty-based confidence scores runtime comparison on laptop PC with Intel i7-9750H CPU and NVIDIA RTX 2080. The best results are in \textbf{bold}.}
\label{table:inf_time_results}
\begin{tabular}{@{}lll@{}}
\toprule
\textbf{Method} & \textbf{Description}                & \textbf{Runtime (ms) $\downarrow$} \\ \midrule
Pred. Entropy   & w/16 MCD samples           & 416.01 ± 12.16     \\
LaREx           & sampling only w/16 zMCD samples     & 25.4 ± 3.22       \\
LaRED           & score w/16 zMCD samples & \textbf{225.90 ± 7.00}     \\ 
LaREM           & score w/16 zMCD samples & 227.88 ± 8.91     \\ \bottomrule
\end{tabular}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%					    SUBSECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Additional Insights \& Discussion}

\textbf{We found that LaREx can work without fine-tuning.} It is possible to just add \say{by hand} a dropout or DropBlock layer to take zMCD samples and get the method working. However, fine-tuning in most cases improves the results (see \Cref{append:obj_detect_exp_detail}). Interestingly, this situation puts in evidence an interesting direction for future work to connect theoretically \texttt{LaREx} with DICE and ASH.

\textbf{On LaREx limitations.} The most noticeable limitation is the need to perform sampling. However, it is not necessary to perform complete forward passes through the network for the method to work. As mentioned in \Cref{subsec:lat_rep_uncertainty}, during deployment, we can speed up sampling by placing a hook on a desired DNN location to extract the feature representation. Then, with the added noise layer (DropBlock or dropout), we generate multiple noisy samples for the extracted representation.
Another constraint lies in the absence of a predefined optimal location and size for DropBlock or dropout for any architecture to take the zMCD samples. Therefore, experimental iterations are required to find the best optimal location and parameters tailored to a specific DNN. However, empirically, we have found that DropBlock sizes of $\sim20-40\%$ of the original feature map size and drop probabilities of $\sim0.5$ are useful for capturing the desired variability in zMCD samples and lead to good results in multiple architectures. 

\textbf{There is \textit{no free lunch} in post-hoc methods.} Throughout our experiments, it became evident that each post-hoc method was influenced by stronger regularization and, notably, by data augmentation. Moreover, identifying a singular post-hoc method that universally outperforms others across diverse computer vision tasks proved challenging since the performance differences in the best detection methods are subtle. Consequently, an interesting line for future work involves the exploration of strategies to combine different confidence scores rather than relying solely on a single method for all tasks and their corresponding architectures.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion}

We presented two uncertainty-based confidence scores, \texttt{LaREM} \& \texttt{LaRED}, to detect data distribution shifts at the image level. The applicability of our confidence scores was demonstrated beyond simple classification, covering also the semantic segmentation and object detection tasks and the corresponding DNN architectures. Besides, our confidence score runtime achieves performance comparable to \textit{SotA} methods while being faster than the traditional MCD method from the BDL framework, becoming an appealing uncertainty-based confidence score alternative. Finally, we provided additional insights into our method and extended the discussion by identifying and proposing different lines for future work.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Acknowledgments}
This work has been supported by the French government under the "France 2030” program, as part of the SystemX Technological Research Institute within the \textit{confiance.ai} Program (\url{www.confiance.ai}).

This publication was made possible by the use of FactoryIA supercomputer, financially supported by the Ile-de-France Regional Council.






%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							REFERENCES					    %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% References
\bibliography{my_bibliography}





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							APPENDIX					    %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\renewcommand{\theHsection}{A.\arabic{section}}
\newpage

\onecolumn

\title{Latent Representation Entropy Density for Distribution Shift Detection\\(Supplementary Material)}
\maketitle

% This Supplementary Material should be submitted together with the main paper.



\appendix


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\texttt{LaREM \& LaRED} Algorithm}
\label{supp-sec:algo}
The computation details for \texttt{LaRED \& LaREM} are available in Algorithm~\ref{algorithm:Algo_LaRED_LaREM}.


\begin{algorithm}[h!]
\footnotesize
% \scriptsize
\caption{Latent Representation Entropy Density-based Distribution Shift Detection: \texttt{LaRED} \& \texttt{LaREM} Confidence Scores.}
\label{algorithm:Algo_LaRED_LaREM}
    \begin{algorithmic}
        \State{\textbf{Definitions:}}
            \Indent
            \begin{itemize}
                \item Trained DNN $p_{\theta}(\rvy \mid \vx)$ with noise layer (dropout or dropblock) layer 
                \item Feature extractor $p_{\theta}(\rvz \mid \vx)$ (Hook on Dropout or DropBlock layer) 
                \item Training dataset samples $\mathcal{D}_{t}=\{ \vx_{n}, \vy_{n} \}_{n}^{N}$
            \end{itemize}
            \EndIndent
        % \State
        \State{\textbf{procedure:} \texttt{setup\_LaREx\_score}:}
        \Indent
            
            % \State{Initialize:} $\Psi$ for dataset entropy samples        
            \For{each $\vx_{n} \in \mathcal{D}_{t}$}
            
                \State get $M$ zMCD samples $\{\vz_{i}\}_{i=1}^{M} \sim p_{\theta}(\rvz \mid \vx_{n})$

                \State $\psi_{n}$ $\leftarrow$ entropy $\big(\{\vz_{i}\}_{i=1}^{M} \big)$

                \State save $\psi_{n}$ sample into $\Psi$

            \EndFor

            \State
            \State $\Psi = \{\psi_{n}\}_{n}^{N}$
            \State

            \If{\texttt{LaRED}}
                % \State $\hat{f}_{\Psi} = \hat{f}_{KDE} \big( \{\psi_{n}\}_{n}^{N} \big)$
                \State $\hat{f}_{\Psi} = \hat{f}_{KDE} \big( \Psi \big)$
            \EndIf

            \If{\texttt{LaREM}}
                % \State $\hat{\mu}_{\psi} \leftarrow \textit{mean} \big( \{\psi_{n}\}_{n}^{N}  \big)$; $\;\;\;\hat{\Sigma}_{\psi} \leftarrow \textit{covariance} \big( \{\psi_{n}\}_{n}^{N}  \big)$
                \State $\hat{\mu}_{\Psi} \leftarrow \textit{mean} \big( \Psi \big)$; $\;\;\;\hat{\Sigma}_{\Psi} \leftarrow \textit{covariance} \big( \Psi \big)$
                \State $\hat{f}_{\Psi} = \mathcal{N}\big( \hat{\mu}_{\Psi}, \hat{\Sigma}_{\Psi} \big)$
            \EndIf
            
        \EndIndent

        \State{\textbf{end procedure}}

        % \State \Comment{this is a comment}
        \State
        
        \State{\textbf{function:} \texttt{get\_LaREx\_score}$(\textit{new sample}\;\vx^{*})$:}
        \Indent
            \State get $M$ zMCD samples $\{\vz_{i}\}_{i=1}^{M} \sim p_{\theta}(\rvz \mid \vx^{*})$

            \State $\psi_{x^{*}}$ $\leftarrow$ entropy $\big(\{\vz_{i}\}_{i=1}^{M} \big)$

            \If{\texttt{LaRED}}
                \State $\mathcal{S} = \log \hat{f}_{KDE} \big( \psi_{\vx^{*}} \big)$
            \EndIf

            \If{\texttt{LaREM}}
                \State $\mathcal{S} = -\Big(  \big( \psi_{\vx^{*}} - \hat{\mu}_{\Psi} \big)^{\top} \hat{\Sigma}_{\Psi}^{-1}  \big( \psi_{\vx^{*}} - \hat{\mu}_{\Psi} \big) \Big)$
            \EndIf

            \State{\textbf{Return} $\mathcal{S}$}

        \EndIndent
        \State{\textbf{end function}}
    \end{algorithmic}
 \end{algorithm}
Entropy estimation was implemented using the \textit{Entropy-Estimators} library\footnote{\url{https://github.com/paulbrodersen/entropy_estimators}}.
For the KDE, in all the experiments, we used a Gaussian kernel and $\text{bandwidth}=1$, and the Scikit-Learn library\footnote{\url{https://scikit-learn.org/stable/about.html}}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{zMCD Relation with MCD for Approximate Bayesian Inference}
\label{append:MCD_zMCD_relation}

To capture uncertainty, we presented the \textit{z Monte-Carlo Dropout} (zMCD) to produce noisy versions of a given layer latent representation. Although zMCD is similar to Monte-Carlo Dropout (MCD) for approximate Bayesian Inference in DNNs, it can not be considered part of the Bayesian deep learning family. First, we consider the Bayesian Neural Network (BNN) and its predictions described in \Cref{eq:BayesPostPredDist}.
\begin{equation}
p(\mathbf{y^{*}} \mid \mathbf{x^{*}},\mathcal{D} ) = \int p(\mathbf{y^{*}} \mid \mathbf{x^{*}},\vtheta) \; p(\vtheta \mid \mathcal{D}) \; d\vtheta
\label{eq:BayesPostPredDist}
\end{equation}
To approximate the \Cref{eq:BayesPostPredDist}, MCD performs a variational inference approximation to the intractable posterior of the wights $p(\vtheta \mid \mathcal{D})$. In this case, dropout is still applied to the representation from a given layer, and it does not cancel the neural network weights by default.

Following the work from \citet{gal2016dropout}, the neural network weight cancel to perform approximate Bayesian inference is achieved by arranging the next linear combination between the latent representation $\vz$ with dropout or dropblock mask $\vm$ and the layer weights $\vtheta$, as shown below:
\begin{equation}
    \begin{split}
        \hat{\vy} &= \sigma \Big( \big(\vz \odot \; \vm \big) \, \vtheta + \vb \Big)\\
        \hat{\vy} &= \sigma \Big( \vz \; \big(\text{diag}(\vm) \cdot \vtheta \big) + \vb \Big)
    \end{split}
    \label{eq:mcd_linear_comb}
\end{equation}

In zMCD, we capture the latent (noisy) feature representation samples directly at the output of the DropBlock or Dropout layers. If we would like to turn our method into the Bayesian framework (MCD), we simply need to collect the activation samples at least after the next layer to respect the weight canceling presented in \Cref{eq:mcd_linear_comb}.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Image Classification Experiments}
\label{append:img_class_exp_detail}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DNN Traning Details}

Architecture and training details can be found in \Cref{table:rn18_class_cifar10_larem_lared_training_details}. We used the Pytorch-lightning library for training and inference. The models with the lowest validation loss were saved and used for subsequent inference and OoD detection evaluation. Additionally, when we indicate that a model has spectral normalization (SN), we apply it after the position for the DropBlock layer. For example, for the models with a DropBlock layer after the second residual block, SN was applied in the 3rd and 4th residual blocks and the fully connected layer.
Data augmentation methods used during training can be found in \Cref{table:rn18_class_cifar10_larem_lared_data_augmentations}.
The seed for all random generators was~9290.

% \vspace{2em}

\begin{table}[h!]
\footnotesize
\centering
\caption{Image classification DNN training details}
\label{table:rn18_class_cifar10_larem_lared_data_augmentations}
\begin{tabular}{@{}ll@{}}
\toprule
Architecture              & ResNet-18               \\
Epochs                    & 300                     \\
Batch size                & 64                      \\
Image size                & 128x128                 \\
Loss                      & Focal                   \\
Optimizer                 & Adam                    \\
Optim. weight decay       & $1\times10^{-4}$        \\
LR scheduler              & Cosine annealing        \\
LR scheduler $\eta_{min}$ & $1\times10^{-5}$        \\ \bottomrule
\end{tabular}
\label{table:rn18_class_cifar10_larem_lared_training_details}
\end{table}
\begin{table}[t!]
\centering
\footnotesize
\caption{Image classification DNN training: Data augmentation details}
\begin{tabular}{@{}ll@{}}
\toprule
Augmentation                              & Parameters                   \\ \midrule
Random Crop                               & padding: img size / 8        \\
Random Color Jitter:                      & p=0.2                        \\
\tabitem contrast                                  & 10\%                         \\
\tabitem brightness                                & 10\%                         \\
\tabitem saturation                                & 10\%                         \\
Random grayscale                  & $p=0.1$                      \\
Random vertical flip             & $p=0.3$                      \\
Random affine:              & $p=0.2$               \\
\tabitem angle                       & 20°                   \\
\tabitem translation                 & 20\%                  \\
\tabitem scale                       & 1\% to 20\%          \\ \bottomrule
\end{tabular}
\end{table}

% TABLE
\begin{table}[t!]
\centering
\scriptsize
\caption{All models trained and tested for image classification with CIFAR10 as InD}
\label{table:rn18_class_cifar10_larex_all_models}
\begin{tabular}{@{}llrrlrllr@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{ID}} & \multicolumn{3}{c}{Dropblock}                               & \multicolumn{2}{c}{Dropout}        & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}SN\end{tabular}}} & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}DA\end{tabular}}} & \multicolumn{1}{l}{\multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}Val.\\Acc\end{tabular}}} \\ \cmidrule(lr){2-6}
\multicolumn{1}{c}{}                       & Loc. & \multicolumn{1}{l}{Size} & \multicolumn{1}{l}{Prob.} & Active & \multicolumn{1}{l}{Prob.} & \multicolumn{1}{c}{}                                                                       & \multicolumn{1}{c}{}                                                                   & \multicolumn{1}{l}{}                                                                     \\ \midrule
$\mathcal{M}_{\text{0}}$                   & RSB1 & 10                       & 0.4                       & No     & 0                         & No                                                                                         & Yes                                                                                    & 89.4                                                                                     \\
$\mathcal{M}_{\text{1(-1-4)}}$             & RSB2 & 5                        & 0.4                       & No     & 0                         & No                                                                                         & Yes                                                                                    & 89.2                                                                                     \\
$\mathcal{M}_{\text{2}}$                   & RSB3 & 3                        & 0.4                       & No     & 0                         & No                                                                                         & Yes                                                                                    & 88.2                                                                                     \\
$\mathcal{M}_{\text{3}}$                   & No   & 0                        & 0                         & Yes    & 0.3                       & No                                                                                         & Yes                                                                                    & 88.7                                                                                     \\
$\mathcal{M}_{\text{1-2}}$                 & RSB2 & 8                        & 0.4                       & No     & 0                         & No                                                                                         & Yes                                                                                    & 88.8                                                                                     \\
$\mathcal{M}_{\text{1-3}}$                 & RSB2 & 3                        & 0.4                       & No     & 0                         & No                                                                                         & Yes                                                                                    & 89.2                                                                                     \\
$\mathcal{M}_{\text{1-1-1}}$               & RSB2 & 5                        & 0.4                       & No     & 0                         & No                                                                                         & No                                                                                     & 84.2                                                                                     \\
$\mathcal{M}_{\text{1-1-2}}$               & RSB2 & 5                        & 0.4                       & Yes    & 0.3                       & No                                                                                         & No                                                                                     & 84                                                                                       \\
$\mathcal{M}_{\text{1-1-3}}$               & RSB2 & 5                        & 0.4                       & No     & 0                         & Yes                                                                                        & No                                                                                     & 86.3                                                                                     \\
$\mathcal{M}_{\text{1-1-4}}$               & RSB2 & 5                        & 0.4                       & No     & 0                         & No                                                                                         & Yes                                                                                    & 89.2                                                                                     \\
$\mathcal{M}_{\text{1-1-5}}$               & RSB2 & 5                        & 0.4                       & Yes    & 0.3                       & Yes                                                                                        & No                                                                                     & 86.9                                                                                     \\
$\mathcal{M}_{\text{1-1-6}}$               & RSB2 & 5                        & 0.4                       & Yes    & 0.3                       & No                                                                                         & Yes                                                                                    & 88.3                                                                                     \\
$\mathcal{M}_{\text{1-1-7}}$               & RSB2 & 5                        & 0.4                       & No     & 0                         & Yes                                                                                        & Yes                                                                                    & 90                                                                                       \\
$\mathcal{M}_{\text{1-1-8}}$               & RSB2 & 5                        & 0.4                       & Yes    & 0.3                       & Yes                                                                                        & Yes                                                                                    & 89.7    \\ \bottomrule                                                                                
\end{tabular}
\end{table}


In \Cref{table:rn18_class_cifar10_larex_all_models}, it is possible to find all the models we trained and tested for the OoD detection task with their corresponding validation set accuracy. We tested different DropBlock and Dropout layer locations. Models $\mathcal{M}_{\text{0}}$ to $\mathcal{M}_{\text{3}}$, vary the location after every residual block (RSB). After noting that the best location was after the second residual block ($\mathcal{M}_{\text{1}}$), we tested with different DropBlock sizes: 3, 5, and 8. These sizes were chosen due to the size of the feature maps, which for the second layer were of size $16\times16$. The best results were obtained with a DropBlock size of 5. Then, we fixed the DropBlock size and location, and we proceeded to test combinations of three different regularization techniques: Dropout, Spectral Normalization (SN), and data augmentation (DA), as described in \Cref{subsec:simple_class}. In \Cref{table:rn18_class_cifar10_larex_all_models}, note that model~$\mathcal{M}\text{1}$ and model $\mathcal{M}_{\text{1-1-4}}$ are exactly the same since they share the same parameters. Moreover, note that the nomenclature used in models from~$\mathcal{M}_{\text{1-1-1}}$ to~$\mathcal{M}_{\text{1-1-8}}$ correspond in the same order to the results presented in \Cref{fig:DNN_regularization_impact_larex}. For example,~$\mathcal{M}_{\text{1-1-1}}$ corresponds to \say{No reg.} and~$\mathcal{M}_{\text{1-1-8}}$ corresponds to \say{DA+DO+SN}. 

% In addition, to evaluate all methods and obtain the metrics, 

For the evaluation of all the methods, we took samples from the datasets. For CIFAR10 (InD), we took a random sample of 8400 images from the training set. The (InD) sampled images were used to build the entropy density estimation from both \texttt{LaRED} and \texttt{LaREM} scores. The same set of image samples was used to estimate the thresholds from DICE and ReAct and, in general, for all the InD score estimators for all the baselines.
Then, the evaluation was performed using the test set of all OoD datasets and CIFAR10 (InD). In this case, we randomly sampled 5000 images from each OoD dataset to perform the evaluation of all baselines and \texttt{LaREx}. This was done with the intention of building balanced-sized datasets. Note that the Textures dataset already has a size of 5000 images. In~\Cref{table:rn18_class_cifar10_full_datasets_vs_subsamples}, we compare the performance of using all the samples from the datasets and using the samples mentioned above, and we found that the differences are not substantial. Therefore, we proceeded to perform the evaluation with the set of samples.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Detailed Results OoD Detection}

\Cref{table:rn18_class_cifar10_larex_all_models_results} expands \Cref{table:rn18_class_cifar10_larex_all_models} and shows that the model with all the three additional regularization techniques obtains the best results. Furthermore, for model~$\mathcal{M}_{\text{1-1-8}}$ (DA+DO+SN) we present the details of the results for all baselines and all datasets in \Cref{table:rn18_class_cifar10_larex_all_methods_results1} and \Cref{table:rn18_class_cifar10_larex_all_methods_results2}. In both tables, it is possible to observe that KNN presents the best overall performance, and LaRED comes in second place in performance across all methods, validating the effectiveness of the non-parametric assumption. Note that \Cref{table:rn18_class_cifar10_larem_lared_results} is built based on the results from \Cref{table:rn18_class_cifar10_larex_all_methods_results1,table:rn18_class_cifar10_larex_all_methods_results2}. To better appreciate the performance of all OoD detection methods, we present the ROC curves of all baselines and \texttt{LaREx} for each OoD dataset in \Cref{fig:roc_curves_per_dataset}. Finally, in \Cref{fig:lared_density_scores_all_ood_datasets}, we can find the density scores for \texttt{LaRED} across all OoD datasets. In such plots, we can appreciate the separation that \texttt{LaRED} achieves per OoD dataset.

Regarding the baseline methods, from the obtained results, we attribute the performance of ASH \citep{djurisic2022extremely}, ReAct \citep{sun2021react}, and DICE+ReAct \citep{sun2022dice} to a sub-optimal choice of parameters. However, we employed the same optimal values found in the corresponding papers, \ie, for ASH, we took the 80th percentile for pruning, for DICE the 90th percentile for sparsifying, and for ReAct, we also used the 90th percentile for clipping.

% \vspace{2em}

% TABLE
\begin{table*}[t!]
\centering
\scriptsize
\caption{Image classification average OoD detection performance using the sampled datasets vs using the full datasets}
\label{table:rn18_class_cifar10_full_datasets_vs_subsamples}
\begin{tabular}{@{}cccc|ccc@{}}
\toprule
\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Datasets\\ size\end{tabular}} & \multicolumn{3}{c|}{LaRED}                                                                                                                                        & \multicolumn{3}{c}{LaREM}                                                                                                                                         \\ \cmidrule(l){2-7} 
                                                                         & \begin{tabular}[c]{@{}c@{}}\textbf{AUPR} $\uparrow$\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{AUROC} $\uparrow$\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{FPR95} $\downarrow$\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{AUPR}$\uparrow$\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{AUROC}$\uparrow$\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{FPR95}$\downarrow$\end{tabular} \\ \midrule
\begin{tabular}[c]{@{}c@{}}Full\\ datasets\end{tabular}                  & 90.29 ± 6.76                                         & 91.02 ± 6.46                                          & 32.23 ± 19.95                                      & 88.22 ± 7.12                                         & 89.31 ± 6.59                                          & 36.43 ± 19.98                                      \\ \midrule
Samples                                                               & 89.7 ± 6.72                                          & 90.80 ± 6.50                                          & 33.16 ± 20.29                                      & 87.60 ± 7.05                                         & 89.20 ± 6.62                                          & 37.33 ± 20.03                                      \\ \bottomrule
\end{tabular}
\end{table*}


% TABLE
\begin{table*}[t!]
\centering
\scriptsize
\caption{LaREx results for all image classification models trained with CIFAR10 (InD) and all OoD datasets}
\label{table:rn18_class_cifar10_larex_all_models_results}
\begin{tabular}{@{}lllllll@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Model}}} & \multicolumn{3}{c}{\textbf{LaRED}}                     & \multicolumn{3}{c}{\textbf{LaREM}}                     \\ \cmidrule(l){2-7} 
\multicolumn{1}{c}{}                       & \textbf{FPR95 $\downarrow$}          & \textbf{AUROC $\uparrow$}        & \textbf{AUPR $\uparrow$}         & \textbf{FPR95 $\downarrow$}          & \textbf{AUROC $\uparrow$}        & \textbf{AUPR $\uparrow$}         \\ \midrule
$\mathcal{M}_{\text{0}}$                                         & 50.80 ± 22.27 & 84.01 ± 8.45  & 82.40 ± 9.09  & 52.73 ± 21.95 & 84.45 ± 7.79  & 83.28 ± 7.99  \\
$\mathcal{M}_{\text{1(-1-4)}}$                                  & 36.89 ± 22.84 & 89.56 ± 7.13  & 88.55 ± 7.36  & 38.91 ± 20.09 & 88.41 ± 6.55  & 86.79 ± 6.93  \\
$\mathcal{M}_{\text{2}}$                                         & 55.25 ± 26.88 & 79.68 ± 14.18 & 78.30 ± 13.09 & 50.69 ± 24.46 & 84.03 ± 9.88  & 82.51 ± 9.82  \\
$\mathcal{M}_{\text{3}}$                                         & 89.03 ± 12.14 & 62.60 ± 6.60  & 60.86 ± 3.97  & 78.89 ± 6.50  & 68.75 ± 8.65  & 65.71 ± 9.51  \\
$\mathcal{M}_{\text{1-2}}$                                       & 54.89 ± 20.70 & 81.56 ± 10.97 & 79.26 ± 12.26 & 58.22 ± 21.21 & 80.17 ± 11.43 & 77.82 ± 12.74 \\
$\mathcal{M}_{\text{1-3}}$                                       & 38.51 ± 19.22 & 89.67 ± 5.93  & 88.98 ± 6.00  & 40.19 ± 22.75 & 88.67 ± 7.38  & \textbf{87.68 ± 7.63}  \\
$\mathcal{M}_{\text{1-1-1}}$                                      & 56.89 ± 22.54 & 82.77 ± 9.04  & 81.83 ± 9.02  & 59.79 ± 21.78 & 80.08 ± 8.85  & 79.05 ± 8.95  \\
$\mathcal{M}_{\text{1-1-2}}$                                     & 53.07 ± 22.02 & 85.24 ± 7.57  & 84.54 ± 7.42  & 56.24 ± 21.20 & 82.44 ± 8.17  & 81.25 ± 8.05  \\
$\mathcal{M}_{\text{1-1-3}}$                                     & 45.88 ± 22.83 & 86.88 ± 8.12  & 86.02 ± 8.29  & 49.76 ± 21.31 & 84.94 ± 8.22  & 83.48 ± 8.57  \\
$\mathcal{M}_{\text{1-1-4}}$                                     & 36.89 ± 22.84 & 89.56 ± 7.13  & 88.55 ± 7.36  & 38.91 ± 20.09 & 88.41 ± 6.55  & 86.79 ± 6.93  \\
$\mathcal{M}_{\text{1-1-5}}$                                     & 46.78 ± 23.06 & 86.62 ± 7.95  & 85.82 ± 7.78  & 52.94 ± 20.68 & 85.06 ± 7.05  & 84.02 ± 7.04  \\
$\mathcal{M}_{\text{1-1-6}}$                                     & 36.67 ± 24.80 & 90.01 ± 7.54  & 89.18 ± 7.80  & 38.56 ± 22.26 & 88.71 ± 7.22  & 87.25 ± 7.91  \\
$\mathcal{M}_{\text{1-1-7}}$                                     & 40.61 ± 28.60 & 87.69 ± 10.22 & 86.75 ± 10.63 & 42.70 ± 27.64 & 86.11 ± 10.70 & 84.58 ± 11.35 \\
$\mathcal{M}_{\text{1-1-8}}$                            & \textbf{33.16 ± 20.29} & \textbf{90.80 ± 6.50}  & \textbf{89.70 ± 6.72}  & \textbf{37.33 ± 20.03} & \textbf{89.20 ± 6.62}  & 87.60 ± 7.05 \\ \bottomrule
\end{tabular}

\end{table*}





% TABLE
\begin{table*}[t!]
\centering
\scriptsize
% \footnotesize
\caption{Detailed results for all methods for FMNIST, SVHN, Places and Textures OoD datasets}
\label{table:rn18_class_cifar10_larex_all_methods_results1}
\begin{tabular}{@{}lrrrrrrrrrrrr@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Method}}} & \multicolumn{3}{c}{\textbf{Fashion MNIST}}                                                                & \multicolumn{3}{c}{\textbf{SVHN}}                                                                         & \multicolumn{3}{c}{\textbf{Places 365}}                                                                   & \multicolumn{3}{c}{\textbf{Textures}}                                                                     \\ \cmidrule(l){2-13} 
\multicolumn{1}{c}{}                                 & \multicolumn{1}{l}{\textbf{FPR95$\downarrow$}} & \multicolumn{1}{l}{\textbf{AUROC$\uparrow$}} & \multicolumn{1}{l}{\textbf{AUPR$\uparrow$}} & \multicolumn{1}{l}{\textbf{FPR95$\downarrow$}} & \multicolumn{1}{l}{\textbf{AUROC$\uparrow$}} & \multicolumn{1}{l}{\textbf{AUPR$\uparrow$}} & \multicolumn{1}{l}{\textbf{FPR95$\downarrow$}} & \multicolumn{1}{l}{\textbf{AUROC$\uparrow$}} & \multicolumn{1}{l}{\textbf{AUPR$\uparrow$}} & \multicolumn{1}{l}{\textbf{FPR95$\downarrow$}} & \multicolumn{1}{l}{\textbf{AUROC$\uparrow$}} & \multicolumn{1}{l}{\textbf{AUPR$\uparrow$}} \\ \midrule
MSP                                                  & 52.36                            & 86.94                              & 88.11                             & 74.93                            & 81.90                              & 84.26                             & 58.22                            & 85.11                              & 86.21                             & 73.85                            & 77.66                              & 77.97                             \\
Pred. entropy                                        & 36.08                            & 91.77                              & 91.85                             & 77.95                            & 84.57                              & 87.27                             & 54.31                            & 88.89                              & 89.65                             & 76.22                            & 79.56                              & 79.10                             \\
MI                                   & 86.20                            & 78.63                              & 80.36                             & 94.10                            & 69.82                              & 72.71                             & 78.75                            & 79.78                              & 81.48                             & 86.29                            & 73.93                              & 74.98                             \\
Energy                                               & 25.58                            & 94.71                              & 95.11                             & 85.86                            & 76.06                              & 77.42                             & 47.26                            & 90.70                              & 91.10                             & 66.63                            & 82.67                              & 78.11                             \\
ASH                                                  & 40.86                            & 91.33                              & 91.52                             & 90.40                            & 67.00                              & 68.11                             & 68.45                            & 79.58                              & 77.87                             & 73.95                            & 68.85                              & 58.93                             \\
ReAct                                                & 91.48                            & 56.68                              & 54.53                             & 93.49                            & 51.42                              & 50.03                             & 94.62                            & 52.60                              & 52.82                             & 92.75                            & 51.63                              & 47.44                             \\
DICE                                                 & 40.28                            & 88.77                              & 83.91                             & 66.48                            & 71.49                              & 61.74                             & 63.02                            & 80.37                              & 75.93                             & 91.44                            & 72.10                              & 77.02                             \\
DICE+ReAct                                         & 89.96                            & 65.22                              & 66.00                             & 86.90                            & 60.34                              & 58.21                             & 94.98                            & 52.15                              & 52.86                             & 95.34                            & 50.33                              & 46.87                             \\
kNN                                                  & \textbf{21.48}                   & \textbf{95.24}                     & \textbf{95.21}                    & 78.31                            & 78.59                              & 78.40                             & 26.30                            & \textbf{94.85}                     & \textbf{94.73}                    & 25.37                            & 94.64                              & 93.91                             \\
Mahalanobis                                          & 78.40                            & 73.28                              & 74.31                             & 84.27                            & 65.16                              & 63.16                             & 36.38                            & 90.22                              & 88.54                             & 8.65                             & 97.79                              & 96.80                             \\
LaRED                                                & 66.56                            & 79.29                              & 78.35                             & \textbf{18.23}                   & \textbf{95.53}                     & \textbf{94.57}                    & \textbf{22.68}                   & 94.70                              & 94.01                             & \textbf{5.83}                    & \textbf{98.63}                     & \textbf{98.14}                    \\
LaREM                                                & 66.10                            & 79.35                              & 77.76                             & 24.66                            & 93.71                              & 92.27                             & 30.02                            & 92.34                              & 91.02                             & 7.54                             & 98.09                              & 97.24                             \\ \bottomrule
\end{tabular}
\end{table*}




% TABLE
\begin{table*}[t!]
\centering
\scriptsize
\caption{Detailed results for all methods for LSUN-C, LSUN-R, and iSUN OoD datasets}
\label{table:rn18_class_cifar10_larex_all_methods_results2}
\begin{tabular}{@{}lrrrrrrrrr@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Method}}} & \multicolumn{3}{c}{\textbf{LSUN-C}}                                                                       & \multicolumn{3}{c}{\textbf{LSUN-R}}                                                                       & \multicolumn{3}{c}{\textbf{iSUN}}                                                                         \\ \cmidrule(l){2-10} 
\multicolumn{1}{c}{}                                 & \multicolumn{1}{l}{\textbf{FPR95$\downarrow$}} & \multicolumn{1}{l}{\textbf{AUROC$\uparrow$}} & \multicolumn{1}{l}{\textbf{AUPR$\uparrow$}} & \multicolumn{1}{l}{\textbf{FPR95$\downarrow$}} & \multicolumn{1}{l}{\textbf{AUROC$\uparrow$}} & \multicolumn{1}{l}{\textbf{AUPR$\uparrow$}} & \multicolumn{1}{l}{\textbf{FPR95$\downarrow$}} & \multicolumn{1}{l}{\textbf{AUROC$\uparrow$}} & \multicolumn{1}{l}{\textbf{AUPR$\uparrow$}} \\ \midrule
MSP                                                  & 50.36                            & 87.44                              & 88.67                             & 60.14                            & 86.34                              & 88.13                             & 62.75                            & 85.73                              & 87.85                             \\
Pred. entropy                                        & 39.88                            & 92.89                              & 93.86                             & 43.30                            & 91.29                              & 92.13                             & 48.60                            & 90.14                              & 91.40                             \\
Pred. MI                                   & 92.22                            & 71.09                              & 73.96                             & 70.68                            & 84.82                              & 86.36                             & 72.69                            & 82.99                              & 84.54                             \\
Energy                                               & 27.68                            & 95.56                              & \textbf{96.52}                    & 39.70                            & 92.69                              & 93.46                             & 43.02                            & 92.24                              & 93.23                             \\
ASH                                                  & 39.22                            & 92.87                              & 93.40                             & 66.62                            & 83.50                              & 84.67                             & 68.27                            & 83.03                              & 84.27                             \\
ReAct                                                & 96.08                            & 54.30                              & 55.73                             & 95.04                            & 50.25                              & 50.15                             & 93.82                            & 52.81                              & 52.27                             \\
DICE                                                 & 80.92                            & 70.24                              & 68.97                             & 69.98                            & 72.03                              & 64.88                             & 82.88                            & 64.67                              & 58.27                             \\
DICE+ReAct                                         & 90.24                            & 51.16                              & 51.77                             & 94.02                            & 52.63                              & 52.81                             & 94.76                            & 51.18                              & 51.42                             \\
kNN                                                  & \textbf{21.24}                   & \textbf{95.98}                     & 96.15                    & \textbf{26.38}                   & \textbf{95.14}                     & \textbf{95.50}                    & \textbf{31.25}                   & \textbf{94.17}                     & \textbf{94.57}                    \\
Mahalanobis                                          & 77.18                            & 70.28                              & 67.76                             & 58.94                            & 82.10                              & 81.31                             & 57.66                            & 81.18                              & 79.08                             \\
LaRED                                                & 30.94                            & 91.93                              & 90.63                             & 48.32                   & 86.57                     & 85.06                    & 39.62                   & 89.00                              & 87.19                             \\
LaREM                                                & 30.56                            & 91.94                              & 90.70                             & 56.04                            & 82.85                              & 80.47                             & 46.40                            & 86.14                              & 83.76                             \\ \bottomrule
\end{tabular}

\end{table*}



% FIGURE
\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.37\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/ROC_curves/roc_textures.png}
        \caption{ROC curve Textures as OoD dataset}
        \label{subfig:roc_textures}
    \end{subfigure}
    \quad
    \quad
    \quad
    \quad
    \begin{subfigure}{0.37\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/ROC_curves/roc_svhn.png}
        \caption{ROC curve SVHN as OoD dataset}
        \label{subfig:roc_svhn}
    \end{subfigure}
    
    \begin{subfigure}{0.37\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/ROC_curves/roc_isun.png}
        \caption{ROC curve iSUN as OoD dataset}
        \label{subfig:roc_isun}
    \end{subfigure}
    \quad
    \quad
    \quad
    \quad
    \begin{subfigure}{0.37\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/ROC_curves/roc_places.png}
        \caption{ROC curve Places as OoD dataset}
        \label{subfig:roc_places}
    \end{subfigure}
    
    \begin{subfigure}{0.37\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/ROC_curves/roc_lsun_c.png}
        \caption{ROC curve LSUN-Crop as OoD dataset}
        \label{subfig:roc_lsunc}
    \end{subfigure}
    \quad
    \quad
    \quad
    \quad
    \begin{subfigure}{0.37\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/ROC_curves/roc_lsun_r.png}
        \caption{ROC curve LSUN-Resize as OoD dataset}
        \label{subfig:roc_lsunr}
    \end{subfigure}

    \begin{subfigure}{0.37\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/ROC_curves/roc_fmnist.png}%
        \caption{ROC curves Fashion-MNIST as OoD dataset}
        \label{subfig:fmnist}
    \end{subfigure}
% \caption{ROC curves for all tested methods across OoD datasets}
\caption{ROC curves for all OoD detection methods and all the OoD evaluation datasets using model $\mathcal{M}_\text{1-1-8}$ and CIFAR10 as InD. \protect\say{mi}: Predictive Mutual Information, \protect\say{pred\_h}: Predictive Entropy, and \protect\say{mdist}: Mahalanobis distance}
\label{fig:roc_curves_per_dataset}
\end{figure*}%
% FIGURE
\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.33\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/density_plots/textures_lared_scores.png}
        \caption{LaRED score density for Textures dataset}
        \label{subfig:lared_scores_textures}
    \end{subfigure}
    \quad
    \quad
    \quad
    \quad
    \begin{subfigure}{0.33\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/density_plots/svhn_lared_scores.png}
        \caption{LaRED score density for SVHN dataset}
        \label{subfig:lared_scores_svhn}
    \end{subfigure}
    
    \begin{subfigure}{0.33\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/density_plots/isun_lared_scores.png}
        \caption{LaRED score density for iSUN dataset}
        \label{subfig:lared_scores_isun}
    \end{subfigure}
    \quad
    \quad
    \quad
    \quad
    \begin{subfigure}{0.33\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/density_plots/places_lared_scores.png}
        \caption{LaRED score density for Places dataset}
        \label{subfig:lared_scores_places}
    \end{subfigure}

    \begin{subfigure}{0.33\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/density_plots/lsun_c_lared_scores.png}
        \caption{LaRED score density for LSUN-C dataset}
        \label{subfig:lared_score_lsun-c}
    \end{subfigure}
    \quad
    \quad
    \quad
    \quad
    \begin{subfigure}{0.33\linewidth}
        \includegraphics[width=\linewidth]{Figures/img_class/density_plots/lsun_r_lared_scores.png}
        \caption{LaRED scores densities for LSUN-R dataset}
        \label{subfig:lared_score_lsun-r}
    \end{subfigure}

    \begin{subfigure}{0.33\linewidth}
        % \centering
        \includegraphics[width=\linewidth]{Figures/img_class/density_plots/fmnist_lared_scores.png}%
        \caption{LaRED score densities for FMNIST dataset}
        \label{subfig:lared_score_fmnist}
    \end{subfigure}
    
\caption{LaRED scores densities in model $\mathcal{M}_\text{1-1-8}$ for all OoD evaluation datasets}
\label{fig:lared_density_scores_all_ood_datasets}
\end{figure*}%





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Training regularization impacts post-hoc detection methods}

In addition to the previous results, we provide evidence for the claim that the performance metrics of all post-hoc methods were influenced by the regularization of the model during training. We statistically tested the hypothesis that the AUROC, the AUPR, and the FPR@95 were different for all baselines and \texttt{LaREx} across OoD datasets for two different models: one that was regularized with data augmentations, dropout, and spectral normalization and one that has none of these regularization techniques. These models correspond to $\mathcal{M}_{\text{1-1-1}}$ (No reg.) and $\mathcal{M}_{\text{1-1-8}}$ (DA+DO+SN) of \Cref{table:rn18_class_cifar10_larex_all_models} and \Cref{fig:DNN_regularization_impact_larex}. Note that the training was in no way done with the goal of favoring any OoD detection in particular. Indeed, they are all post-hoc methods, so no specific training is needed for any of them. For this sake, having 12 OoD detection methods (all 10 baselines plus \texttt{LaRED} and \texttt{LaREM}) and 7 OoD datasets, we have a sample of 84 data points per metric and model.
We performed a normality test using the Shapiro-Wilk test, and the results from \Cref{table:rn18_class_cifar10_shapiro_test} show that it is not likely that the samples follow a normal distribution. Therefore, we proceeded to apply a non-parametric test: Mann-Whitney's U. The results of the test can be seen in \Cref{table:rn18_class_cifar10_mann_whitney_test}.

\begin{table}[pt!]
\centering
% \tiny
\scriptsize
\caption{Shapiro-Wilk normality test results for OoD detection metrics for two models across baselines and across datasets}
\label{table:rn18_class_cifar10_shapiro_test}
\begin{tabular}{@{}llrrrrr@{}}
\toprule
Model                   & Metric & \multicolumn{1}{l}{Statistic} & \multicolumn{1}{l}{p} & \multicolumn{1}{l}{Mean} & \multicolumn{1}{l}{Median} & \multicolumn{1}{l}{Std.} \\ \midrule
\multirow{3}{*}{$\mathcal{M}_{\text{1-1-8}}$} & AUPR   & 0.894                         & $4.48\times10^{-6}$              & 79.03                    & 83.83                      & 14.96                         \\
                        & FPR95 & 0.9373                        & $4.94\times10^{-4}$              & 60.68                    & 66.29                      & 26.00                            \\
                        & AUROC  & 0.8976                        & $6.16\times10^{-6}$              & 79.56                    & 82.92                      & 14.37                         \\ \midrule
\multirow{3}{*}{$\mathcal{M}_{\text{1-1-1}}$} & AUPR   & 0.9036                        & $1.13\times10^{-5}$              & 78.39                    & 81.71                      & 10.65                         \\
                        & FPR95 & 0.8936                        & $4.18\times10^{-6}$              & 72.38                    & 75.11                      & 17.62                         \\
                        & AUROC  & 0.9542                        & $4.63\times10^{-3}$              & 77.91                    & 79.98                      & 9.530                          \\ \bottomrule
\end{tabular}


\end{table}



\begin{table}[pt!]
\centering
\footnotesize
\caption{Mann-Whitney U test results for OoD detection metrics for two models across baselines and across datasets}
\label{table:rn18_class_cifar10_mann_whitney_test}
\begin{tabular}{@{}lrr@{}}
\toprule
Metric & \multicolumn{1}{l}{Statistic} & \multicolumn{1}{l}{p} \\ \midrule
AUPR   & 1.227                         & 0.2195                \\
AUROC  & 1.836                         & 0.0662                \\
FPR@95 & -2.636                        & 0.00838          \\ \bottomrule    
\end{tabular}
\end{table}

From the results from \Cref{table:rn18_class_cifar10_mann_whitney_test} it follows that the models have statistically significant differences in their FPR@95, and near-significant results for the AUROC. The AUPR did not show statistically significant differences between the models. From this, together with the descriptive statistics in \Cref{table:rn18_class_cifar10_shapiro_test}, it is possible to conclude that the regularization during training impacts all the tested OoD detection methods in a way that is statistically significant, in the sense that regularization seems to improve the performance of all tested methods. Also, these results are preliminary since they are not the main goal of our study, and further research is needed about how training procedures, regularization, and InD dataset characteristics may affect several OoD detection methods. We hypothesize that all of them have a general influence on the OoD detection task.



\newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Object Detection Experiments}
\label{append:obj_detect_exp_detail}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DNN Traning Details}

For the object detection experiments, we took as a starting point the already trained model from \citet{du2022vos}, for a Faster-RCNN trained on BDD100K, using the \textit{Detectron2} \cite{Detectron2018} library and openly available at their repository. Since it was not clear from the VOS paper or the code if the models were trained using Dropout, and clearly, they were not trained to use DropBlock, we ran experiments performing fine-tuning on the pre-trained models. 
For fine-tuning the RPN and Box Head, we froze the backbone and unfroze the RPN and subsequent layers completely, using a learning rate of $1\times 10^{-4}$, and we kept the rest of the original hyper-parameters of the model (original learing rate was $2\times 10^{-2}$). The used DropBlock size was 4, with a drop probability of 0.5.
For fine-tuning the Box Heads, the backbone and the RPN were frozen, and the rest of the layers were trained with the same learning rates as described for the RPN. The dropout layer had a $p=0.5$. All fine-tuning took place for 10 epochs. Moreover, we also tested the method by simply adding the Dropout or DropBlock layer (without fine-tuning) to the pre-trained network, and the results are shown in the respective section. We extracted 16 zMCD samples for all runs.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Dimensionality Reduction}

In this case of object detection, since the dimensionality of the extracted samples was high (a vector with 1280 components for the RPN samples and 1000 for the Box Head samples), we performed PCA before feeding the data to the final Kernel Density Estimator (KDE) or the Mahalanobis distance estimator.
We used a randomly chosen sample of 8000 images from the training data from which we extracted the 16 zMCD samples per image, which makes up a tensor of dimension $(8000\times16, Z_s)$, where $Z_s$, the latent space size was either 1280 for the RPN or 1000 for the Box Head. The entropy was calculated, and we obtained a tensor of size $(8000, Z_s)$.
We performed and evaluated the performance for several PCA dimension sizes: $\{1, 6, 14, 20, 24, 32, 40, 48, 56, 64, 72, 80\}$. The results of the PCA evaluation for the RPN hook models can be seen in \Cref{fig:larex_pca_evaluation}. For \texttt{LaRED}, there is a peak performance for 40 components, whereas for \texttt{LaREM}, the more components, the better. 
 



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%						SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Detailed Results}
For the detailed results of the experiments run on the number of PCA components to use, see \Cref{table:obj_det_pca_comps_results}. For all experiments, the random generators were seeded with the number 42. Based on the PCA results from \Cref{fig:larex_pca_evaluation}, we chose the 40 components for LaRED and 80 for LaREM. Furthermore, to visualize the separation achieved for our method, we present a PaCMAP~\citep{JMLR:v22:20-1061} projection in 2 dimensions of the entropy scores in \Cref{fig:pacmap_coco_openimages}.

\Cref{table:obj_det_fine_tune_results}, presents the results for fine-tuned vs not fine-tuned models. The former models have a better performance, and, in agreement with the Image classification, the results are better for the RPN, which is a more intermediate layer.
% not as close to the output as the Box Heads.
Furthermore, \Cref{table:obj_det_zmcd_samples_results} presents the impact of the number of zMCD samples in the performance metrics. In general, we observe that the more samples that are taken, the better. However, even with 5 zMCD samples, the performance drop is not extreme, which shows the robustness of the presented method and the possibility of experimenting and finding a trade-off with fewer zMCD samples, which indeed is one of the main limitations of our approach. Finally, \Cref{fig:obj_dtct_roc_curves_ood} presents a visualization of the performance from the detection methods in terms of ROC curves. We observe that most methods achieve high performance, validating the separability from both classes (InD and OoD) depicted previously in \Cref{fig:pacmap_coco_openimages}.



\begin{figure*}[pt!]
    \centering
    \includegraphics[width=0.6\linewidth]{Figures/Object_detection/pca_auroc_larex.png}
    \caption{\texttt{LaRED} \& \texttt{LaREM} evaluation of AUROC for several number of PCA components}
    \label{fig:larex_pca_evaluation}
\end{figure*} 

\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.45\linewidth}
        \includegraphics[width=\linewidth]{Figures/Object_detection/h_z_pacmap_coco.png}
        \caption{BDD100K (InD) vs COCO (OoD)}
        \label{subfig:bdd_vs_coco}
    \end{subfigure}
    \quad
    \quad
    \begin{subfigure}{0.45\linewidth}
        \includegraphics[width=\linewidth]{Figures/Object_detection/h_z_pacmap_openimages.png}
        \caption{BDD100K (InD) vs OpenImages (OoD)}
        \label{subfig:bdd_vs_openimages}
    \end{subfigure}
    % \caption{Entropy vectors PacMAP 2D projection for InD vs OoD datasets}
    % \caption{Object Detection InD vs OoD datasets entropy vectors PacMAP 2D projection}
    \caption{Faster R-CNN BDD-100K (InD): Entropy vectors 2D projection comparison using PaCMAP}
    \label{fig:pacmap_coco_openimages}
\end{figure*}

\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.45\linewidth}
        \includegraphics[width=\linewidth]{Figures/Object_detection/roc_coco.png}
        \caption{ROR curve COCO as OoD dataset}
        \label{subfig:roc_coco}
    \end{subfigure}
    \quad
    \quad
    \begin{subfigure}{0.45\linewidth}
        \includegraphics[width=\linewidth]{Figures/Object_detection/roc_openimages.png}
        \caption{ROC curve OpenImages as OoD dataset}
        \label{subfig:roc_openimages}
    \end{subfigure}
    
    \caption{Object Detection with Faster-RCNN: ROC curves for all OoD detection methods for COCO and OpenImages as OoD datasets, fine-tuned model, LaRED RPN. "mi": Predictive Mutual Information, "pred\_h": Predictive Entropy}
    \label{fig:obj_dtct_roc_curves_ood}
\end{figure*}






\begin{table*}[pt!]
\centering
\footnotesize
\caption{Detailed results for number of PCA components LaREx Object detection InD BDD100k, fine-tuned model}
\label{table:obj_det_pca_comps_results}
\begin{tabular}{@{}cccccccc@{}}
\toprule
\multicolumn{1}{l}{Model}                                             & \begin{tabular}[c]{@{}c@{}}PCA \\ components\end{tabular} & \begin{tabular}[c]{@{}c@{}}AUPR \\ COCO\end{tabular} & \begin{tabular}[c]{@{}c@{}}AUPR \\ OpenImages\end{tabular} & \begin{tabular}[c]{@{}c@{}}AUROC\\ COCO\end{tabular} & \begin{tabular}[c]{@{}c@{}}AUROC \\ OpenImages\end{tabular} & \begin{tabular}[c]{@{}c@{}}FPR95\\ COCO\end{tabular} & \begin{tabular}[c]{@{}c@{}}FPR95 \\ OpenImages\end{tabular} \\ \midrule
\multirow{12}{*}{\begin{tabular}[c]{@{}c@{}}LaRED\\ RPN\end{tabular}} & 1                                                         & 63.7                                                 & 55.05                                              & 64.57                                                & 53.99                                               & 93.03                                                & 93.41                                               \\
                                                                      & 6                                                         & 98.2                                                 & 99.23                                              & 98.06                                                & 99.07                                               & 8.98                                                 & 3.8                                                 \\
                                                                      & 14                                                        & 99.34                                                & 99.56                                              & 99.21                                                & 99.44                                               & 3.35                                                 & 1.98                                                \\
                                                                      & 20                                                        & 99.26                                                & 99.41                                              & 98.08                                                & 99.26                                               & 3.85                                                 & 2.72                                                \\
                                                                      & 24                                                        & 99.51                                                & 99.58                                              & 99.43                                                & 99.5                                                & 2.5                                                  & 1.93                                                \\
                                                                      & 32                                                        & 99.78                                                & 99.89                                              & 99.75                                                & 99.88                                               & 1.01                                                 & 0.45                                                \\
                                                                      & 40                                                        & \textbf{99.84}                                       & \textbf{99.89}                                     & \textbf{99.81}                                       & \textbf{99.87}                                      & \textbf{0.32}                                        & \textbf{0.22}                                       \\
                                                                      & 48                                                        & 99.26                                                & 99.36                                              & 99.01                                                & 99.1                                                & 1.49                                                 & 1.02                                                \\
                                                                      & 56                                                        & 96.36                                                & 96.8                                               & 95.11                                                & 95.55                                               & 42.76                                                & 38.78                                               \\
                                                                      & 64                                                        & 87.43                                                & 89                                                 & 84.82                                                & 85.89                                               & 49.57                                                & 50.48                                               \\
                                                                      & 72                                                        & 77.22                                                & 79.86                                              & 75.16                                                & 76.46                                               & 48.88                                                & 48.89                                               \\
                                                                      & 80                                                        & 73.34                                                & 77.25                                              & 71.14                                                & 73.93                                               & 50.9                                                 & 48.32                                               \\ \midrule
\multirow{12}{*}{\begin{tabular}[c]{@{}c@{}}LaREM\\ RPN\end{tabular}} & 1                                                         & 60.62                                                & 53.95                                              & 60.88                                                & 51.55                                               & 93.29                                                & 93.01                                               \\
                                                                      & 6                                                         & 96.89                                                & 98.51                                              & 96.72                                                & 98.26                                               & 16.38                                                & 9.02                                                \\
                                                                      & 14                                                        & 98.27                                                & 98.75                                              & 97.9                                                 & 98.41                                               & 10.15                                                & 7.55                                                \\
                                                                      & 20                                                        & 98.11                                                & 98.7                                               & 97.66                                                & 98.32                                               & 11.91                                                & 8.8                                                 \\
                                                                      & 24                                                        & 99.36                                                & 99.62                                              & 99.22                                                & 99.52                                               & 3.29                                                 & 1.53                                                \\
                                                                      & 32                                                        & 99.64                                                & 99.8                                               & 99.59                                                & 99.76                                               & 1.64                                                 & 0.85                                                \\
                                                                      & 40                                                        & 99.7                                                 & 99.85                                              & 99.66                                                & 99.82                                               & 1.48                                                 & 0.68                                                \\
                                                                      & 48                                                        & 99.73                                                & 99.87                                              & 99.69                                                & 99.84                                               & 1.11                                                 & 0.34                                                \\
                                                                      & 56                                                        & 99.79                                                & 99.91                                              & 99.77                                                & 99.89                                               & 0.74                                                 & 0.17                                                \\
                                                                      & 64                                                        & 99.78                                                & 99.92                                              & 99.75                                                & 99.9                                                & 0.85                                                 & 0.11                                                \\
                                                                      & 72                                                        & 99.78                                                & 99.92                                              & 99.75                                                & 99.9                                                & 0.9                                                  & 0.056                                               \\
                                                                      & 80                                                        & \textbf{99.8}                                        & \textbf{99.93}                                     & \textbf{99.76}                                       & \textbf{99.91}                                      & \textbf{0.79}                                        & \textbf{0.012}                                      \\ \bottomrule
\end{tabular}
\end{table*}




\begin{table*}[pt!]
\centering
\footnotesize
\caption{Detailed results for fine-tuned vs not fine-tuned models LaREx Object detection InD BDD100k}
\label{table:obj_det_fine_tune_results}
\begin{tabular}{lllllc}
\hline
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Methods}}}             & \multicolumn{2}{c}{\textbf{OoD: COCO}}                                  & \multicolumn{2}{c}{\textbf{OoD -  OpenImages}}                          & \multirow{2}{*}{\textbf{mAP}} \\ \cline{2-5}
\multicolumn{1}{c}{}                                              & \multicolumn{1}{c}{\textbf{FPR95}} & \multicolumn{1}{c}{\textbf{AUROC}} & \multicolumn{1}{c}{\textbf{FPR95}} & \multicolumn{1}{c}{\textbf{AUROC}} &                               \\ \hline
\begin{tabular}[c]{@{}l@{}}LaRED RPN \\ Fine-tuned\end{tabular}   & \textbf{0.31±0.3}                  & \textbf{99.81±0.4}                 & 0.22±0.21                          & 99.88±0.6                          & 28.0                          \\
\begin{tabular}[c]{@{}l@{}}LaREM RPN \\ Fine-tuned\end{tabular}   & 0.79±0.58                          & 99.77±0.26                         & \textbf{0.11±0.09}                 & \textbf{99.91±0.08}                & 28.0                          \\
\begin{tabular}[c]{@{}l@{}}LaRED RPN \\ No Fine-tune\end{tabular} & 0.90±0.8                           & 99.79±0.3                          & 0.22±0.4                           & 99.89±0.7                          & 31.21                         \\
\begin{tabular}[c]{@{}l@{}}LaRED FC \\ Fine-tuned\end{tabular}    & 12.07±0.6                          & 97.48±0.8                          & 10.33±1.2                          & 97.54±0.9                          & 29.6                          \\
\begin{tabular}[c]{@{}l@{}}LaRED FC\\ No Fine-tune\end{tabular}   & 42.12±0.8                          & 90.50±0.7                          & 31.51±0.7                          & 93.20±0.5                          & 31.21                         \\ \hline
\end{tabular}
\end{table*}




\begin{table*}[pt!]
\centering
\footnotesize
% \caption{Detailed results for the number of zMCD samples to take LaRED RPN Object detection InD BDD100k, fine-tuned model}
\caption{Detailed results for the number of zMCD samples to take LaRED RPN Object detection InD BDD100k, fine-tuned model}
\label{table:obj_det_zmcd_samples_results}
\begin{tabular}{@{}ccccccc@{}}
\toprule
\textbf{zMCD} & \textbf{\begin{tabular}[c]{@{}c@{}}AUPR\\ COCO\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}AUPR\\ OpenImages\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}AUROC \\ COCO\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}AUROC \\ OpenImages\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}FPR95\\ COCO\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}FPR95\\ OpenImages\end{tabular}} \\ \midrule
20            & \textbf{99.84}                                               & \textbf{99.89}                                                     & \textbf{99.81}                                                 & \textbf{99.87}                                                       & \textbf{0.31}                                                 & \textbf{0.22}                                                       \\
16            & 99.71                                                        & 99.87                                                              & 99.67                                                          & 99.83                                                                & 1.48                                                          & 0.34                                                                \\
10            & 99.19                                                        & 99.57                                                              & 99.07                                                          & 99.46                                                                & 4.25                                                          & 2.04                                                                \\
5             & 96.45                                                        & 97.41                                                              & 96.1                                                           & 97.02                                                                & 20.58                                                         & 16.41                                                               \\ \bottomrule
\end{tabular}
\end{table*}






\newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Semantic Segmentation Experiments}
\label{append:sem_seg_exp_detail}
% \ToDo{Add Text}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%   					SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DNN Training Details}
\label{append:sem_seg_dnn_training_details}
For the semantic segmentation task, we consider the DeepLabv3+ \citep{chen2018encoder}, and U-Net \citep{ronneberger2015u} architectures and the Cityscapes and Woodscape datasets for training (InD). In the Deeplabv3+ architecture\footnote{\url{https://github.com/VainF/DeepLabV3Plus-Pytorch}}, we added a DropBlock layer at the output of the ResNet encoder using a block size of $8 \times 8$ and drop probability $p=0.5$ to take zMCD samples. The encoder output results in a tensor of shape $W/16 \times H/16 \times 2048$, where $W$ and $H$ represent the input image width and height, respectively, and the last dimension corresponds to the number of channels. For the U-Net architecture, we place a DropBlock layer at the output of the encoder using a block size of $8 \times 8$ and drop probability $p=0.5$ to take zMCD samples. For the U-Net DNN trained with the Woodscape dataset, the encoder output has 128 channels. The U-Net DNN trained with the Cityscapes dataset has an encoder output with 256 channels. \Cref{table:sem_seg_dnn_hyperparameteres} summarize the used DNN training hyperparameters for each architecture and dataset.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%   					SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Evaluation Datasets}
\label{append:sem_seg_dist_shift_datasets}

As mentioned in \Cref{subsec:sem_seg}, we consider data with covariate shift for the semantic segmentation experiments. We used the Albumentations\footnote{\url{https://albumentations.ai/}} library to create a \textit{synthetic anomalies} version of the InD datasets. For the synthetic anomalies, we used the \textit{Random Fog} and \textit{Random Sun flare} transforms, and we implemented a custom transform to add the \textit{Mud on lens} effect. \Cref{fig:sem_seg_dnn_ind_datasets} shows samples on the (InD) training sets, while \Cref{fig:sem_seg_dnn_cs_anomal_dataset,fig:sem_seg_dnn_ws_anomal_dataset,fig:sem_seg_dnn_ws_soil_dataset} show samples of the datasets with covariate-shift used for evaluation.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Semantic Segmentation Detailed Results}
\label{append:sem_seg_exp_detailed_tables}
% For Deeplabv3+, 
% Different from the other tasks, in semantic segmentation,
We use all the training dataset samples to set up and compute the InD scores from \texttt{LaREx} and the implemented baselines. the evaluation is implemented using all the samples from the validation and test sets. \Cref{table:ws_dplbv3p_semseg_larem_lared_results,table:semseg_unet_lared_results} present the detailed performance results for each evaluated distribution shift dataset, in Deeplabv3+ and U-Net, respectively. The reason for \texttt{LaREx} performance difference can be attributed to a sub-optimal selection of the parameters and to the presence of \say{clean} InD images in the evaluation datasets. In contrast to the other experiments, the Mahalanobis distance in both Deeplabv3+ models has the best performance results across the evaluated datasets. This was also the case for \texttt{LaREM} when compared with \texttt{LaRED}. We attribute the dominance of the Mahalanobis-based methods to entropy vector dimensionality since no dimensionality reduction (w/PCA) is performed that might suppress useful information for the detection. The entropy vectors 2D projection using PaCMAP \citep{JMLR:v22:20-1061} are displayed in \Cref{fig:sem_seg_dplbv3p_cs_2d_proj,fig:sem_seg_dplbv3p_ws_2d_proj} for Deeplabv3+, and in \Cref{fig:sem_seg_unet_cs_2d_proj,fig:sem_seg_unet_ws_2d_proj} for U-Net, validating the effectiveness of the entropy vectors for the distribution shift detection task, and supporting our analysis of the results. Moreover, \Cref{fig:sem_seg_dplbv3p_cs_lared_score_comp,fig:sem_seg_dplbv3p_cs_larem_score_comp,fig:sem_seg_dplbv3p_ws_lared_score_comp,fig:sem_seg_dplbv3p_ws_larem_score_comp} show the \texttt{LaREx} score comparison for each evaluated dataset in Deeplabv3+, and \Cref{fig:sem_seg_unet_cs_lared_score_comp,fig:sem_seg_unet_ws_lared_score_comp} show the \texttt{LaRED} score comparison for each evaluated dataset in U-Net.





\newpage
\begin{table*}[t!]
\centering
\scriptsize
% \footnotesize
\caption{Semantic Segmentation DNN training details}
\label{table:sem_seg_dnn_hyperparameteres}
\begin{tabular}{@{}lrrrr@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Parameter}}}             & \multicolumn{2}{c}{\textbf{Deeplabv3+}} & \multicolumn{2}{c}{\textbf{U-Net}}      \\ \cmidrule(l){2-5} 
\multicolumn{1}{c}{} &
  \multicolumn{1}{c}{\textbf{Cityscapes}} &
  \multicolumn{1}{c}{\textbf{Woodscape}} &
  \multicolumn{1}{c}{\textbf{Cityscapes}} &
  \multicolumn{1}{c}{\textbf{Woodscape}} \\ \midrule
img size                                                            & 512x256            & 640x483            & 128x256            & 128x256            \\
epochs                                                              & 1500               & 350                & 1800               & 1400               \\
batch size                                                          & 16                 & 8                  & 16                 & 16                 \\
Loss                                                                & Focal              & Focal              & CE                 & CE                 \\
Optim                                                               & SGD                & SGD                & Adam               & Adam               \\
Weight decay                                                        & $5\times10^{-4}$   & $5\times10^{-4}$   & -                  & -                  \\
LR scheduler &
  \begin{tabular}[c]{@{}r@{}}Cosine\\ annealing\end{tabular} &
  \begin{tabular}[c]{@{}r@{}}Cosine\\ annealing\end{tabular} &
  \begin{tabular}[c]{@{}r@{}}Cosine\\ annealing\end{tabular} &
  \begin{tabular}[c]{@{}r@{}}Cosine\\ annealing\end{tabular} \\
\begin{tabular}[c]{@{}l@{}}LR scheduler\\ $\eta_{min}$\end{tabular} & $1\times10^{-3}$   & $1\times10^{-3}$   & $2.3\times10^{-5}$ & $2.3\times10^{-5}$ \\ \bottomrule
\end{tabular}
\end{table*}
\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.25\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/cityscapes_sample_2.png}
        \caption{Cityscapes sample}
        \label{subfig:citscapes_sample}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.25\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/woodscape_sample.png}
        \caption{Woodscape sample}
        \label{subfig:woodscape_sample}
    \end{subfigure}
\caption{Semantic segmentation DNN InD (training) datasets samples}
\label{fig:sem_seg_dnn_ind_datasets}
\end{figure*}
\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/cityscapes_anomalies_flare_sample_3.png}
        \caption{Flare sample}
        \label{subfig:sem_seg_cs_anomal_flare}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/cityscapes_anomalies_fog_sample_1.png}
        \caption{Fog sample}
        \label{subfig:sem_seg_cs_anomal_fog}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/cityscapes_anomalies_mud_sample_1.png}
        \caption{Mud-on-lens sample}
        \label{subfig:sem_seg_cs_anomal_mud}
    \end{subfigure}
\caption{Cityscapes-Anomalies dataset samples}
\label{fig:sem_seg_dnn_cs_anomal_dataset}
\end{figure*}
\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/woodscape_anomalies_sample_2.png}
        \caption{Flare sample}
        \label{subfig:sem_seg_ws_anomal_flare}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/woodscape_anomalies_sample_1.png}
        \caption{Fog sample}
        \label{subfig:sem_seg_ws_anomal_fog}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/woodscape_anomalies_sample_3.png}
        \caption{Mud-on-lens sample}
        \label{subfig:sem_seg_ws_anomal_mud}
    \end{subfigure}
\caption{Woodscape-Anomalies dataset samples}
\label{fig:sem_seg_dnn_ws_anomal_dataset}
\end{figure*}
\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/woodscape_soiling_sample_3.png}
        \caption{Mud-on-lens sample}
        \label{subfig:sem_seg_ws_soil_mud_on_lens}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/woodscape_soiling_sample_4.png}
        \caption{Drops-on-lens sample}
        \label{subfig:sem_seg_ws_soil_water_drops}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.3\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/woodscape_soiling_sample_6.png}
        \caption{Dust on lens sample}
        \label{subfig:sem_seg_ws_soil_fog}
    \end{subfigure}
\caption{Woodscape-Soiling dataset samples}
\label{fig:sem_seg_dnn_ws_soil_dataset}
\end{figure*}



\newpage

\begin{table*}[pt!]
\centering
% \footnotesize
\scriptsize
\begin{tabular}{@{}llllrrrrrr@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Methods}}} &
  \multicolumn{3}{c}{\textbf{Cityscapes-Anomalies}} &
  \multicolumn{3}{c}{\textbf{Woodscape}} &
  \multicolumn{3}{c}{\textbf{Woodscape-Soiling}} \\ \cmidrule(l){2-10} 
\multicolumn{1}{c}{} &
  \textbf{FPR95 $\downarrow$} &
  \textbf{AUROC $\uparrow$} &
  \multicolumn{1}{l|}{\textbf{AUPR $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l|}{\textbf{AUPR $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{AUPR $\uparrow$}} \\ \midrule
Mahalanobis &
  3.21 &
  99.17 &
  99.31 &
  0.0 &
  99.94 &
  99.95 &
  0.0 &
  99.98 &
  99.99 \\
KNN &
  24.35 &
  95.62 &
  96.0 &
  0.48 &
  99.62 &
  99.67 &
  0.0 &
  99.87 &
  99.92 \\
LaREM-2048 &
  9.0 &
  98.26 &
  98.34 &
  0.0 &
  99.91 &
  99.91 &
  0.0 &
  99.99 &
  100.0 \\
LaRED-PCA58 &
  32.35 &
  92.48 &
  92.27 &
  0.26 &
  99.14 &
  99.28 &
  0.0 &
  99.51 &
  99.65 \\ \bottomrule
\end{tabular}
% \caption{Cityscapes Deeplabv3+ semantic segmentation distribution shift detection results for Cityscapes-Anomalies, Woodscape, and Woodscape-Soiling datasets.}
\caption{Deeplabv3+ trained w/Cityscapes dataset: distribution shift detection results}
\label{table:cs_dplbv3p_semseg_larem_lared_results}
\end{table*}

\begin{table*}[t!]
\centering
% \footnotesize
\scriptsize
\begin{tabular}{@{}lrrrrrrrrr@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Methods}}} &
  \multicolumn{3}{c}{\textbf{Woodscape-Anomalies}} &
  \multicolumn{3}{c}{\textbf{Cityscapes}} &
  \multicolumn{3}{c}{\textbf{Woodscape-Soiling}} \\ \cmidrule(l){2-10} 
\multicolumn{1}{c}{} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l|}{\textbf{AUPR $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l|}{\textbf{AUPR $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{AUPR $\uparrow$}} \\ \midrule
Mahalanobis &
  0.48 &
  99.6 &
  99.71 &
  0.0 &
  99.67 &
  99.84 &
  4.28 &
  98.8 &
  99.2 \\
KNN &
  4.51 &
  98.69 &
  98.96 &
  0.0 &
  99.79 &
  99.89 &
  10.44 &
  97.93 &
  98.41 \\
LaREM-2048 &
  29.72 &
  92.78 &
  88.5 &
  5.43 &
  96.52 &
  94.88 &
  28.67 &
  81.42 &
  84.14 \\
LaRED-PCA50 &
  20.39 &
  94.32 &
  92.48 &
  4.3 &
  98.54 &
  98.5 &
  13.11 &
  95.87 &
  95.45 \\ \bottomrule
\end{tabular}
% \caption{Woodscape Deeplabv3+ semantic segmentation distribution shift detection results for Woodscape Anomalies, Cityscapes, and Woodscape-Soiling datasets.}
\caption{Deeplabv3+ trained w/Woodscape dataset: distribution shift detection results}
\label{table:ws_dplbv3p_semseg_larem_lared_results}
\end{table*}



\begin{table*}[t!]
\centering
% \footnotesize
\scriptsize
\begin{tabular}{@{}llrrrrrrrrr@{}}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Method}}} &
  \multicolumn{1}{c}{\multirow{2}{*}{\textbf{\begin{tabular}[c]{@{}c@{}}InD\\ Dataset\end{tabular}}}} &
  \multicolumn{3}{c}{\textbf{InD-Anomalies}} &
  \multicolumn{3}{c}{\textbf{Woodscape / Cityscapes}} &
  \multicolumn{3}{c}{\textbf{Woodscape-Soiling}} \\ \cmidrule(l){3-11} 
\multicolumn{1}{c}{} &
  \multicolumn{1}{c}{} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l|}{\textbf{AUPR $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l|}{\textbf{AUPR $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{FPR95 $\downarrow$}} &
  \multicolumn{1}{l}{\textbf{AUROC $\uparrow$}} &
  \multicolumn{1}{l}{\textbf{AUPR $\uparrow$}} \\ \midrule
LaRED-PCA50 & Cityscapes & 31.21 & 90.88 & 89.43 & 14.23 & 9.71  & 97.37 & 7.94  & 97.88 & 98.25 \\
LaRED-PCA50 & Woodscapes & 17.4  & 97.28 & 97.87 & 7.11  & 98.24 & 98.7  & 35.94 & 90.73 & 91.31 \\ \bottomrule
\end{tabular}
\caption{U-Net distribution shift detection results}
\label{table:semseg_unet_lared_results}
\end{table*}

\newpage

\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/entropy_2d_projection_cs_vs_cs_anomaly.png}
        \caption{Cityscapes vs Cityscapes-Anomalies}
        \label{subfig:sem_seg_dplbv3p_cs_vs_cs_anomal_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/entropy_2d_projection_cs_vs_ws.png}
        \caption{Cityscapes vs Woodscape}
        \label{subfig:sem_seg_dplbv3p_cs_vs_ws_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/entropy_2d_projection_cs_vs_ws_soil.png}
        \caption{Cityscapes vs Woodscape-Soiling}
        \label{subfig:sem_seg_dplbv3p_cs_vs_ws_soil_2d_proj}
    \end{subfigure}
\caption{Deeplabv3+ Cityscapes (InD): Entropy vectors 2D projection comparison using PaCMAP}
\label{fig:sem_seg_dplbv3p_cs_2d_proj}
\end{figure*}


\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/entropy_2d_projection_ws_vs_ws_anomaly.png}
        \caption{Woodscape vs Woodscape-Anomalies}
        \label{subfig:sem_seg_dplbv3p_ws_vs_ws_anomal_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/entropy_2d_projection_ws_vs_cs.png}
        \caption{Woodscape vs Cityscapes}
        \label{subfig:sem_seg_dplbv3p_ws_vs_cs_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/entropy_2d_projection_ws_vs_ws_soil.png}
        \caption{Woodscape vs Woodscape-Soiling}
        \label{subfig:sem_seg_dplbv3p_ws_vs_ws_soil_2d_proj}
    \end{subfigure}
\caption{Deeplabv3+ trained w/Woodscape (InD): Entropy vectors 2D projection comparison using PaCMAP}
\label{fig:sem_seg_dplbv3p_ws_2d_proj}
\end{figure*}



% \newpage
\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/lared_pca58_results_cs_vs_cs_anomal.png}
        \caption{Cityscapes vs Cityscapes-Anomalies}
        \label{subfig:sem_seg_dplbv3p_cs_vs_cs_anomal_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/lared_pca58_results_cs_vs_ws.png}
        \caption{Cityscapes vs Woodscape}
        \label{subfig:sem_seg_dplbv3p_cs_vs_ws_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/lared_pca58_results_cs_vs_ws_soil.png}
        \caption{Cityscapes vs Woodscape-Soiling}
        \label{subfig:sem_seg_dplbv3p_cs_vs_ws_soil_lared_score}
    \end{subfigure}
\caption{DeepLabv3+ trained w/Cityscapes (InD): LaRED score comparison\vspace{0.5em}}
\label{fig:sem_seg_dplbv3p_cs_lared_score_comp}
\end{figure*}

\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/larem_results_cs_vs_cs_anomal.png}
        \caption{Cityscapes vs Cityscapes-Anomalies}
        \label{subfig:sem_seg_dplbv3p_cs_vs_cs_anomal_larem_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/larem_results_cs_vs_ws.png}
        \caption{Cityscapes vs Woodscape}
        \label{subfig:sem_seg_dplbv3p_cs_vs_ws_larem_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_cs/larem_results_cs_vs_ws_soil.png}
        \caption{Cityscapes vs Woodscape-Soiling}
        \label{subfig:sem_seg_dplbv3p_cs_vs_ws_soil_larem_score}
    \end{subfigure}
\caption{DeepLabv3+ trained w/Cityscapes (InD): LaREM score comparison\vspace{0.5em}}
\label{fig:sem_seg_dplbv3p_cs_larem_score_comp}
\end{figure*}


\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/lared_pca50_results_ws_vs_ws_anomal.png}
        \caption{Woodscape vs Woodscape-Anomalies}
        \label{subfig:sem_seg_dplbv3p_ws_vs_ws_anomal_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/lared_pca50_results_ws_vs_cs.png}
        \caption{Woodscape vs Cityscapes}
        \label{subfig:sem_seg_dplbv3p_ws_vs_cs_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/lared_pca50_results_ws_vs_ws_soil.png}
        \caption{Woodscape vs Woodscape-Soiling}
        \label{subfig:sem_seg_dplbv3p_ws_vs_ws_soil_lared_score}
    \end{subfigure}
\caption{DeepLabv3+ trained w/Woodscape (InD): LaRED score comparison \vspace{0.5em}}
\label{fig:sem_seg_dplbv3p_ws_lared_score_comp}
\end{figure*}


\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/larem_results_ws_vs_ws_anomal.png}
        \caption{Woodscape vs Woodscape-Anomalies}
        \label{subfig:sem_seg_dplbv3p_ws_vs_ws_anomal_larem_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/larem_results_ws_vs_cs.png}
        \caption{Woodscape vs Cityscapes}
        \label{subfig:sem_seg_dplbv3p_ws_vs_cs_larem_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_dplbv3p_ws/larem_results_ws_vs_ws_soil.png}
        \caption{Woodscape vs Woodscape-Soiling}
        \label{subfig:sem_seg_dplbv3p_ws_vs_ws_soil_larem_score}
    \end{subfigure}
\caption{DeepLabv3+ trained w/Woodscape (InD): LaREM score comparison\vspace{0.5em}}
\label{fig:sem_seg_dplbv3p_ws_larem_score_comp}
\end{figure*}

\vspace{1em}
\newpage
\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_cs_lared_entropy_2d_projection_cs_vs_cs_anomal.png}
        \caption{Cityscapes vs Cityscapes-Anomalies}
        \label{subfig:sem_seg_unet_cs_vs_cs_anomal_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_cs_lared_entropy_2d_projection_cs_vs_ws.png}
        \caption{Cityscapes vs Woodscape}
        \label{subfig:sem_seg_unet_cs_vs_ws_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_cs_lared_entropy_2d_projection_cs_vs_ws_soil.png}
        \caption{Cityscapes vs Woodscape-Soiling}
        \label{subfig:sem_seg_unet_cs_vs_ws_soil_2d_proj}
    \end{subfigure}
\caption{U-Net trained w/Cityscapes (InD): Entropy vectors 2D projection comparison using PaCMAP}
\label{fig:sem_seg_unet_cs_2d_proj}
\end{figure*}
\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_ws_lared_entropy_2d_projection_ws_vs_ws_anomal.png}
        \caption{Woodscape vs Woodscape-Anomalies}
        \label{subfig:sem_seg_unet_ws_vs_ws_anomal_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_ws_lared_entropy_2d_projection_ws_vs_cs.png}
        \caption{Woodscape vs Cityscapes}
        \label{subfig:sem_seg_unet_ws_vs_ws_2d_proj}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_ws_lared_entropy_2d_projection_ws_vs_ws_soil.png}
        \caption{Woodscape vs Woodscape-Soiling}
        \label{subfig:sem_seg_unet_ws_vs_ws_soil_2d_proj}
    \end{subfigure}
\caption{U-Net trained w/Woodscape (InD): Entropy vectors 2D projection comparison using PaCMAP}
\label{fig:sem_seg_unet_ws_2d_proj}
\end{figure*}
\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_cs_lared_dist_cs_vs_cs_anomal.png}
        \caption{Cityscapes vs Cityscapes-Anomalies}
        \label{subfig:sem_seg_unet_cs_vs_cs_anomal_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_cs_lared_dist_cs_vs_ws.png}
        \caption{Cityscapes vs Woodscape}
        \label{subfig:sem_seg_unet_cs_vs_ws_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_cs_lared_dist_cs_vs_ws_soil.png}
        \caption{Cityscapes vs Woodscape-Soiling}
        \label{subfig:sem_seg_unet_cs_vs_ws_soil_lared_score}
    \end{subfigure}
\caption{U-Net trained w/Cityscapes (InD): LaRED score comparison}
\label{fig:sem_seg_unet_cs_lared_score_comp}
\end{figure*}
\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_ws_lared_dist_ws_vs_ws_anomal.png}
        \caption{Woodscape vs Woodscape-Anomalies}
        \label{subfig:sem_seg_unet_ws_vs_ws_anomal_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_ws_lared_dist_ws_vs_cs.png}
        \caption{Woodscape vs Cityscapes}
        \label{subfig:sem_seg_unet_ws_vs_cs_lared_score}
    \end{subfigure}
    % \hfil
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/results_unet/unet_ws_lared_dist_ws_vs_ws_soil.png}
        \caption{Woodscape vs Woodscape-Soiling}
        \label{subfig:sem_seg_unet_ws_vs_ws_soil_lared_score}
    \end{subfigure}
\caption{U-Net trained w/Woodscape (InD): LaRED score comparison}
\label{fig:sem_seg_unet_ws_lared_score_comp}
\end{figure*}





% \newpage
\clearpage
\pagebreak
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%   					   SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Image level detection vs more detailed detection schemes}
\label{append:image_lvl_vs_detailed_lvl_detect}


We performed image-level OoD detection, where the whole input image is classified as InD or OoD, even for object detection and semantic segmentation tasks. In this regard, our results allow us to wonder if more detailed or localized detection schemes alone are sufficient for detecting distribution shifts in complex computer vision tasks. In object detection, our results from \Cref{table:obj_detect_bdd100k_LaRED} show that adapted simple post-hoc methods for image level detection can surpass recent \textit{SotA} object level detection methods \citep{du2022vos,wilson2023safe}. 
In semantic segmentation, recent benchmarks \citep{chan2021segmentmeifyoucan} also consider adapted post-hoc methods for anomaly detection at the pixel level. Nevertheless, the execution runtime for these methods is prohibitive for safety-critical applications with tight time constraints.
Therefore, we believe that image-level detection can be seen as a previous or complementary step towards object-level or pixel-level OoD detection, which, for sure, are more difficult problems.


Regarding the evaluation, for the object detection task, the OoD datasets (COCO and OpenImages) are quite far away semantically and visually from the InD BDD100k. Objects, backgrounds, and scenes were all quite different, which creates an ideal situation for our proposed method and uncertainty-based confidence scores. For semantic segmentation, the evaluation was limited to covariate-shift data close to the InD datasets. In this case, the evaluation can be extended using datasets covered in anomaly segmentation benchmarks~\citep{chan2021segmentmeifyoucan} and with datasets with stronger semantic shifts.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%   					SUB-SECTION							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{On Semantic Segmentation Predictive Uncertainty with MCD}
\label{append:sem_seg_sem_seg_pred_entropy_mcd}

It is important to reveal the limitations of predictive entropy with MCD. Predictive uncertainty in semantic segmentation ends up providing a confidence measure per pixel instead of an image-level confidence measure. A qualitative inspection of predictive entropy maps in \Cref{fig:sem_seg_dplbv3p_mcd_robustness} shows no noticeable difference in the predictions from semantically similar samples. Concretely, a DeepLabv3+ DNN trained with the Woodscape dataset is sufficiently robust to handle input samples from the Cityscapes. Although robustness is a desired property in DNNs, we cannot assume that the validation or test set performance will hold for new \say{shifted} samples. 
Moreover, image perturbation due to environmental factors can lead to wrong overconfident predictions, as illustrated in~\Cref{fig:dplbv3_mcd_preds_anomaly}. From a strict safety point of view, it is impossible to provide performance guarantees given the high dimensional input space and the ignorance of all the potential factors that can cause or lead to a data distribution shift. Safety is about rare, high-consequence events as those depicted in ~\Cref{fig:dplbv3_mcd_preds_anomaly}. Therefore, the detection of both mild and drastic distribution shifts is paramount for safe deployment and to elicit trust in the DNN-based component, as shown with both of our proposed confidence scores \texttt{LaRED} \& \texttt{LaREM} in \Cref{append:sem_seg_exp_detail}. 


\newpage

\begin{figure*}[pt!]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/mcd_pred/ws_dplbv3p_InD_preds.png}
        \caption{Woodscape input sample (InD)}
        \label{subfig:sem_seg_mcd_preds_ind_ws}
    \end{subfigure}
    \quad
    \begin{subfigure}{0.32\linewidth}
        \includegraphics[width=\linewidth]{Figures/sem_seg/mcd_pred/ws_dplbv3p_shift_cs_preds.png}
        \caption{Cityscapes Input sample (shift)}
        \label{subfig:sem_seg_mcd_preds_shift_cs}
    \end{subfigure}
\caption{DeepLabv3+ MCD predictions for an InD sample vs
Cityscapes (shift) sample, denoting the DNN robustness}
\label{fig:sem_seg_dplbv3p_mcd_robustness}
\end{figure*}
\begin{figure*}[pt!]
    \centering
    \includegraphics[width=0.82\linewidth]{Figures/sem_seg/mcd_pred/semseg_overconfident_wrong_pred.png}
    \caption{Deeplabv3+ MCD predictions and predictive entropy qualitative comparison for InD sample w/Covariate shift. The top row shows the input image with mud on lens perturbation and the ground truth labels. The bottom row shows the DNN MCD predicted semantic and entropy maps. The yellow circle highlights the wrong overconfident predictions when a relevant actor in the environment is partially occluded by the mud perturbation, exhibiting the DNN’s lack of understanding of semantic structures and contexts}
    \label{fig:dplbv3_mcd_preds_anomaly}
\end{figure*}




\end{document}
