\section{Related Work}
\label{sec:related}
\subsection{Stochastic Uncertainty Quantification}
Uncertainty quantification evaluates how confident a model is in its predictions by accounting for both the limitations of the model itself and the variability present in the data~\citep{intro}.
A common Bayesian-inspired method for estimating uncertainty is Monte Carlo (MC) Dropout.
MC Dropout approximates Bayesian inference by applying Dropout during inference and sampling multiple stochastic forward passes through the network:
\begin{equation}
\label{eq:predictive_bayes}
    \mathbb{P}(y | \mathbf{x}, \mathcal{D}_{\gamma}) = \int \mathbb{P}(y|\mathbf{x}, \theta)\mathbb{P}(\theta|\mathcal{\mathcal{D}_{\gamma})}d\theta,
\end{equation}
where $\mathcal{D}_{\gamma}$ denotes the training dataset, $\mathbf{x}$ the input and $y$ the label.
Formally, for classification with $K$ classes for an input $\mathbf{x}$, $M$ stochastic forward passes during the inference phase yields a set of outputs $\{P(y|\mathbf{x},\theta_1), \dots, P(y|\mathbf{x},\theta_M) \}$ and performs the approximation of Eq.\eqref{eq:predictive_bayes}  by the following empirical mean:  
\begin{equation}
    \forall y\in \llbracket 1,K\rrbracket, \frac{1}{M} \sum_{i=1}^MP(y|\mathbf{x}, \theta_i) \simeq \mathbb{P}(y | \mathbf{x}, \mathcal{D}_{\gamma}).
\label{eq:approx_bayes}
\end{equation}

Despite its success, Dropout may be limited in capturing sufficient diversity in predictions because it focuses only on deactivating neuron outputs. To address this, DropConnect~\citep{wan2013regularization} provides a more effective mechanism for inducing diversity by injecting fine-grained noise directly into the weight matrices,  and producing more nuanced perturbations in the network's embeddings.

Let $U \in \mathcal{M}_{K \times D}(\mathbb{R})$ be the weight matrix of a fully connected layer, so that for an input $\mathbf{x} \in \mathbb{R}^D$ the output is $\mathbf{z} = U \mathbf{x}$.
DropConnect introduces stochasticity directly into the weight matrix as follows:
\begin{enumerate}[wide, labelwidth=!, labelindent=5pt]
    \item Generate a binary mask matrix B $\in \{0,1\}^{K \times D}$,
    where each entry is sampled independently as 
    $ B_{ij} \sim \operatorname{Bernoulli}(p)$ and $p = 1-q,$ with $q$ being the probability of $B_{ij} = 1$.
    \item Compute the effective weight matrix via the Hadamard product:
    $\widetilde{U} = U \odot B$. For an input $\mathbf{x} \in \mathbb{R}^D$, the layer’s output becomes
    $\tilde{\mathbf{z}} = \widetilde{U}\mathbf{x} = (U \odot B)\mathbf{x}$.
    
\end{enumerate}

%\subsubsection{Disentangle Epistemic and Aleatoric Uncertainty}

%\textcolor{brown}{The Maximum Softmax Probability(MSP)~~\citep{msp} method has often been criticized over the years for its inability to capture epistemic uncertainty or, more cautiously, for its failure to distinguish between epistemic and aleatoric uncertainty. Since OoD data relates to epistemic uncertainty, ~\citep{priornetwork} emphasize that MSP’s reliance on high-entropy posteriors is problematic. Given the posterior, $\mathbb{P(\theta| \mathcal{D})}$, the aleatoric uncertainty is defined as:
%\begin{equation} \mathcal{H}_{\text{aleatoric}}(\mathbf{x},y) :=
 % \mathbb{E}_{\theta \sim \mathbb{P}(\theta \mid \mathcal{D})} \Big[ -\mathbb{E}_{\mathbb{P}(y \mid \mathbf{x}, \theta)} \big[ \log_2 \mathbb{P}(y|\mathbf{x}, \theta) \big] \Big]. 
%\end{equation} MSP struggles to distinguish ID ambiguities from genuine OoD data because both can produce similar high-entropy score vectors. In terms of Shannon information theory, these vectors often present a flat or nearly uniform distribution of probabilities ~\citep{ovadia}.}
\subsection{Deterministic uncertainty quantification}

\subsubsection{Local density, distance and curse of dimensionality}

%The Bayesian approach to epistemic uncertainty often struggles with detecting OoD inputs, leading to the development of deterministic uncertainty quantification methods. 
To detect OoD samples, a simple idea is to measure the uncertainty of the samples and to classify as OoD those whose uncertainty exceeds a certain threshold.
For instance, prior deterministic methods such as those proposed in~\citet{duq, DDU} define uncertainty in terms of the distance or local density of an input relative to training samples in the embedding space. An input 
$\mathbf{x}$ is assumed to belong to the training distribution if its embedding $\mathbf{z} = h_\theta(\mathbf{x})$ lies near a class-specific centroid $\boldsymbol{\mu}_c$. Thus, the uncertainty score is defined as:
\begin{equation}
    \text{Uncertainty}(\mathbf{x}) \propto \min_c \|\mathbf{z} - \boldsymbol{\mu}_c \|^2.
\end{equation}
An input is classified as OoD if its minimum distance to any class centroid exceeds a threshold. This approach links large distances in the embedding space to higher uncertainty. However, it can be limited by the curse of dimensionality~\citep{vershynin2018high}. %\textcolor{brown}{Additionally, these methods do not separate aleatoric from epistemic uncertainty (if needed).}

In high-dimensional embedding spaces, the discriminative power of distances to class centroids or local density can diminish, so simple thresholds become less effective at separating ID and OoD samples. This phenomenon is a well-documented manifestation of the curse of dimensionality, where increasing feature dimensions can erode the meaningfulness of distance and density metrics, even though the neural network produces well-separated clusters in the embedding space~\citep{olteanu2023meta}.  Recent works start taking this into account explicitly. For instance, SIREN~\citep{du2022siren} projects the embeddings into a smaller-dimensional space and then normalize them on the hypersphere to fit a von Mises–Fisher distribution. \citet{nguyen2024combining} likewise use a projection to reduce dimensionality when describing the embedding’s geometry.

\subsubsection{Analytical methods}

Recent analytical OoD detection approaches instead intervene directly in the network’s internal representations or constraining activation patterns to more reliably handle OoD inputs. Recent works from~\citet{sun2021react, azizmalayeri2024mitigating} modify the embedding activations through clipping above a high percentile threshold based on ID statistics to directly suppress the excessive signals often produced by OoD inputs. \citet{djurisic2022extremely} similarly, truncate activations beyond a certain percentile and proportionally scale the rest to diminish the impact of hypersensitive neuron.
Alternatively, works from~\citet{L2norm, wei2022mitigating} scale embedding and pre-softmax logits respectively during training. More precisely, \citet{L2norm} scale embedding vectors so that their norms more faithfully reflect each input’s difficulty. LogitNorm method~\citep{wei2022mitigating} by contrast, rescales pre-softmax logits, observing that even when most training examples are already classified correctly, the softmax cross-entropy loss keeps driving logit norm large, leading to overconfidence.

\subsection{Geometry of the embedding}
Several studies from~\citet{softmax, calib} have examined the geometric and analytical properties of the embedding space induced by the CE loss to improve OoD detection. CE loss promotes class separation  by creating well-defined geometric structures within the embedding space, where samples from the same class are tightly clustered and different classes are well-separated. This phenomenon, known as \textbf{Neural Collapse} (NC), described by~\citet{papaye} and illustrated in Fig. ~\ref{fig:neural_collapse}, occurs in the final stages of training. NC describes the convergence of class embeddings to well-separated class means, or centroids, while the within-class variance decreases. Specifically, the embeddings of samples within the same class collapse to their respective class means, and the class means themselves align symmetrically in a spatially equi-distributed/repartitioned way that maximizes inter-class separation. Additionally, the class vectors align with the embeddings, so that each representation points toward its corresponding class prototype.