%\documentclass{uai2025} % Final submission to show authors 
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
\usepackage{soul}

% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\usepackage[normalem]{ulem}
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)



\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{float}
\usepackage{algorithm}\usepackage{algpseudocode}
\usepackage{authblk}
\usepackage{ulem}
\usepackage{amsthm}
\usepackage{todonotes}
\usepackage{svg}
\usepackage{stmaryrd}
\usepackage{subcaption}
\usepackage{nth}




\title{Stochastic Embeddings: A Probabilistic and Geometric Analysis of Out-of-Distribution Behavior}

\author[1,2]{Anthony Nguyen}
\author[2]{Emanuel Aldea}
\author[2]{Sylvie Le Hégarat-Mascle}
\author[1]{Renaud Lustrat}

%% affiliatons
\affil[1]{Thales Land and Air Systems, BU ARC, Limours, France}
\affil[2]{SATIE Laboratory UMR 8029, Paris-Saclay University, CNRS, Gif-sur-Yvette, France}


%METTRE NOM AUTEURS
\newtheorem{theorem}{Theorem}[section]
\newtheorem{definition}{Definition}[section]
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{lemma}[theorem]{Lemma}

\newcommand{\remEmi}[1]{\textcolor{red}{\sout{#1}}}
\newcommand{\addEmi}[1]{\textcolor{red}{#1}}
\newcommand{\ComEmi}[1]{\textcolor{red}{\textit{#1}}}

\newcommand{\remSyl}[1]{\textcolor{cyan}{\sout{#1}}}
\newcommand{\addSyl}[1]{\textcolor{blue}{#1}}
\newcommand{\ComSyl}[1]{\textcolor{blue}{\textit{#1}}}

%%place below for for defining operators 
\newcommand{\etal}{\textit{et al.}}
\newcommand{\Var}{\operatorname{Var}}
\newcommand{\Tr}{\operatorname{Tr}}



\newcommand{\spike}[2]% #1 = size of spike, #2 = centered text
{\bgroup
  \sbox0{#2}%
  \rlap{\usebox0}%
  \hspace{0.5\wd0}%
  \makebox[0pt][c]{\rule[\dimexpr \ht0+1pt]{0.5pt}{#1}}% top spike
  \makebox[0pt][c]{\rule[\dimexpr -\dp0-#1-1pt]{0.5pt}{#1}}% bottom spike
  \hspace{0.5\wd0}%
\egroup}
%%
\begin{document}
\maketitle
\begin{abstract}
Deep neural networks perform well in many applications but often fail when exposed to out-of-distribution (OoD) inputs. We identify a geometric phenomenon in the embedding space: in-distribution (ID) data show higher variance than OoD data under stochastic perturbations. Using high-dimensional geometry and statistics, we explain this behavior and demonstrate its application in improving OoD detection. Unlike traditional post-hoc methods, our approach integrates uncertainty-aware tools, such as Bayesian approximations, directly into the detection process. Then, we show how considering the unit hypersphere enhances the separation of ID and OoD samples. Our mathematically sound method achieves competitive performance while remaining simple.
\end{abstract}

\section{Introduction}\label{sec:intro}

Machine learning models are widely used in fields such as healthcare, autonomous systems, and natural language processing. Deploying them in real-world applications poses challenges often overlooked during development. For instance, a key challenge is detecting when models are uncertain or encounter unfamiliar inputs. Despite their strong performance on clean datasets, deep neural networks often overestimate confidence on unknown or degraded inputs~\citep{guo2017calibration}. This raises reliability concerns, particularly in sensitive applications where models encounter unexpected data or distributions not seen during training.

Uncertainty quantification has become crucial for evaluating model predictions. Early methods, inspired by Bayesian statistics~\citep{robert2005choix}, led to approaches like Deep Ensemble~\citep{lakshminarayanan2017simple} and Bayesian Neural Network approximations. These methods were later adapted for OoD detection~\citep{priornetwork, natPN}. OoD detection focuses on identifying inputs that do not fit the statistical features of the training data. Such inputs may correspond to novel or anomalous situations where the model’s predictions could be unreliable. Simple methods based on Softmax confidence scores~\citet{msp} have shown limitations, as Deep Neural Networks (DNNs) often give overconfident predictions, even for artificial OoD inputs~\citep{hein2019relu}.


Recently, several deterministic methods have been proposed to quantify uncertainty~\citep{duq, DDU, nguyen2024combining}, often using distances or local density in the embedding space. These methods focus on its geometry, which becomes complex in high-dimensional settings~\citep{nalisnick2019detecting}.
Recent theoretical advances have examined and exploited the geometry~\citep{papaye, softmax, neco} and analytical properties~\citep{calib, L2norm} induced by the Cross-Entropy (CE) loss function to enhance OoD detection. These insights show how some structural properties of the basic CE Loss can provide a fruitful way to enhance separation between ID and OoD inputs. Concurrently, competitive post-hoc methods on pre-trained networks~\citep{djurisic2022extremely, sun2022dice, sun2021react} have shown simplicity and strong OoD detection performance. Some of these methods exploit embedding geometric properties~\citep{maha, deepknn} and are often more efficient than methods requiring additional training~\citep{zhang2023openood}. Despite these advances in probabilistic and deterministic methods, detecting OoD samples accurately while ensuring interpretability and robustness remains challenging~\citep{deciphering, jaeger2022call}. Our contributions are as follows:
\begin{enumerate}[wide, labelwidth=!, labelindent=5pt]
\item \textbf{Exploration of the variance behavior when injecting stochasticity into the embedding space}: we investigate a counter-intuitive observation arising from the application of Monte Carlo (MC) Dropout within the embedding space, instead of the more commonly studied logit space. One would expect OoD samples to exhibit greater variance across multiple stochastic forward passes, reflecting higher uncertainty compared to ID data during inference. However, our empirical results show the opposite: ID samples consistently exhibit higher variance than OoD samples under MC Dropout.
\item \textbf{Mathematical explanation}: Using high-dimensional probability theory and differential geometry, we explain this variance behavior through the geometric properties of the hypersphere and isotropic random vectors. We show how this insight improves OoD detection.
\item \textbf{A simple and effective algorithm}:
We present an algorithm that delivers excellent performance on standard OoD benchmarks. It is easy to implement, robust in high dimensions, and supported by solid mathematical foundations.
\end{enumerate}
\input{related.tex}

\section{Preliminaries}

This section introduces the notation and background used throughout the paper. We define key symbols and provide the mathematical framework underlying our study.

\subsection{Hypotheses and background }

 Let the training set and the testing set be denoted as $\mathcal{D}_{\text{Train}}= \left\{(\mathbf{x}_i, y_i), i\in \llbracket 1,N_{\text{Train}}\rrbracket\right\}$ and $\mathcal{D}_{\text{Test}}= \left\{(\mathbf{x}_i, y_i), i\in \llbracket 1,N_{\text{Test}}\rrbracket\right\}$ respectively. Here $\mathbf{x}_i \in \mathbb{R}^p$ represents an image and $y_i \in \llbracket 1,K\rrbracket$ its associated label where $K$ stands for the total number of classes. We assume that both datasets are independently and identically distributed (i.i.d.) according to their respective joint distributions $\mathbb{P}_{\text{Train}}:= \mathbb{P}_\text{Train}(\mathbf{x},y)$ and $\mathbb{P}_{\text{Test}}:= \mathbb{P}_\text{Test}(\mathbf{x},y)$.


\textbf{Out-of-Distribution (OoD)}: We assume that the training and test sets follow a common distribution denoted by $\mathbb{P}_{\text{ID}}$ (ID data).  We introduce another test set of OoD samples 
$\left\{(\mathbf{x}_i, \upsilon_i), i \in \llbracket 1,N_{\text{OoD}}\rrbracket\right\}$ which are drawn i.i.d. from an unknown distribution denoted by $\mathbb{P}_{\text{OoD}}$, distinct from $\mathbb{P}_{\text{ID}}$. 

In the context of image classification,  the embedding of an input image $\mathbf{x}$ is defined by $\mathbf{z} = h_\theta( \mathbf{x}) \in \mathbb{R}^D$ where $D$ denotes the dimension of the embedding space. Inputs, embedding related vectors, and class vectors are written in bold, 
%
$\| \mathbf{x}\| = \sqrt{\sum_{i=1}^D x_i^2}$ denotes the $L^2$ norm of the vector $\mathbf{x} \in \mathbb{R}^D$ and $S^{D-1} := \{ \mathbf{x} \in \mathbb{R}^D \ | \  \| \mathbf{x}\| = 1\}$  denotes the unit hypersphere in $\mathbb{R}^D$.  

The model is divided into two components: the feature extractor denoted by $h_\theta$ and a final linear layer $g_\theta$ acting as a classifier. Thus, the DNN’s output can be written as $f_\theta(\mathbf{x}) = g_\theta \circ h_\theta (\mathbf{x})$. Since $g_\theta: \mathbb{R}^D \to \mathbb{R}^K$ is a linear operator, it can be expressed as a weight matrix $W_\theta \in \mathcal{M}_{K,D}(\mathbb{R})$.


To introduce stochasticity during the inference phase, consider a fixed input $\mathbf{x}$.  Let $\left\{\mathbf{z}^{(m)}, m \in\llbracket 1,M\rrbracket\right\}$ represent the collection of embeddings obtained from $M$ stochastic forward passes through the network. Each embedding is defined as $\mathbf{z}^{(m)}:= h_\theta(\mathbf{x}; \sigma^{(m)}) \in \mathbb{R}^D$, where $\sigma^{(m)}$ denotes the stochastic perturbation applied during the $m$-th forward pass.  


If $\mathbf{x}_{\text{ID}} \sim \mathbb{P}_{\text{ID}}(\mathbf{x})$ (resp. $\mathbf{x}_{\text{OoD}} \sim \mathbb{P}_{\text{OoD}}(\mathbf{x})$, we denote by $Z_{\text{ID}} \in \mathcal{M}_{D, M}(\mathbb{R})$ (resp. $Z_{\text{OoD}}$) the matrix whose columns are the vectors $\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(M)}$, omitting the index $M$ to simplify the notation.


If DropConnect is applied to produce $M$ vectors, the matrix with these vectors as its columns is denoted by $Z_{DC}$ or $A_{DC}$ depending on the context. 

To quantify the dispersion of these families of vectors, for any matrix \( Z \in \mathcal{M}_{D, M}(\mathbb{R}) \), we define the non-biased estimator:
    \begin{equation}
            \Var(Z):=\Tr\left( \frac{1}{M-1} \sum_{i=1}^M (\mathbf{z}^{(i)} - \boldsymbol{\mu})(\mathbf{z}^{(i)} - \boldsymbol{\mu})^T \right),
    \end{equation}
    where
    \text{Tr} is the Trace operator of a matrix,  i.e., the sum of its diagonals entries and
    \begin{equation}
            \boldsymbol{\mu} = \frac{1}{M} \sum_{i=1}^M \mathbf{z}^{(i)}.
    \end{equation}




\subsection{Behavior of the Embedding under Cross Entropy optimization}
\label{sec:behavior}
\subsubsection{Geometrical behavior}

Consider a deterministic DNN taking an image $\mathbf{x} \in \mathbb{R}^p$ and generating an embedding $\mathbf{z} = h_\theta( \mathbf{x}) \in \mathbb{R}^D$. This embedding is then passed through the classification layer,  producing a logit vector $\boldsymbol{\ell} := ( \ell_1,..., \ell_K) \in \mathbb{R}^K$. The logits are then normalized using the Softmax function:
\begin{equation} \label{eq:1}
\forall k \in \llbracket 1,K\rrbracket,  \mathbb{P}(y = k | \mathbf{x}) = \frac{\exp \ell_k}{\sum_{i=1}^K \exp\ell_i} \in [0,1].
\end{equation}

For a given sample $( \mathbf{x},y) \in \mathbb{R}^p \times \llbracket 1,K\rrbracket$, the CE loss used for backpropagation is defined as 
\begin{equation} \label{eq:2}
\mathcal{L}_{CE}( \mathbf{x}, y) = \mathbb{E}_{(\mathbf{x},y)\sim \mathbb{P}_{\text{Train}}}[-\log \mathbb{P}(y|\mathbf{x})].
\end{equation}

The NC phenomenon studied by~\citet{papaye} and shown in Fig.~\ref{fig:neural_collapse}, describes how class embeddings converge to well-separated centroids in the later training stages. 


In fact, NC is not something unusual or particular, but rather a natural phenomenon due to its mathematical basis and empirical consistency in supervised learning. \citet{neuralcollapse1} provide theoretical justification, showing that NC arises as an optimal configuration under cross-entropy minimization. Additionally, \citet{neuralcollapse2} extend this understanding by observing that supervised contrastive loss also leads to similar geometric configurations, indicating that this structure emerges naturally and consistently across different optimization paradigms in deep learning.

Regarding OoD behavior under Neural Collapse, \citet{softmax} demonstrated that when the data exhibits low aleatoric uncertainty and the feature extractor is sufficiently deep, the simplex configuration depicted in Fig.~\ref{nc:step4} is both achievable and optimal for OoD detection using MSP~\citep{msp}, as OoD embedding samples tend to cluster near the origin and around the decision boundaries. This finding is further supported by~\citet{neco}.
\begin{figure}[h!]
  \centering
  \begin{subfigure}[t]{.48\linewidth}
    \centering\includegraphics[width=1\linewidth]{stage1_neural_collapse_crop.pdf}
    \caption{}
    \label{nc:step1}
  \end{subfigure}
  \begin{subfigure}[t]{.48\linewidth}
    \centering\includegraphics[width=1.1\linewidth]{stage2_neural_collapse_crop.pdf}
    \caption{}
    \label{nc:step2}
  \end{subfigure}
  \\
   \begin{subfigure}[t]{.48\linewidth}
    \centering\includegraphics[width=1.05\linewidth]{stage3_neural_collapse_crop.pdf}
    \caption{}
    \label{nc:step3}
  \end{subfigure}
  \begin{subfigure}[t]{.48\linewidth}
    \centering\includegraphics[width=1.02\linewidth]{stage4_neural_collapse_crop.pdf}
    \caption{}
    \label{nc:step4}
  \end{subfigure}
  \caption{
    Illustration of Neural Collapse with the progressive emergence of a simplex configuration from~\ref{nc:step1} to ~\ref{nc:step4}.}
    \label{fig:neural_collapse}
\end{figure}
\subsubsection{Analytical behavior}
\label{sec:analytic}
In Eq.~\eqref{eq:1}, $\forall k \in \llbracket 1,K\rrbracket,  \ell_k$ can be expressed using cosine similarity and the classifier’s weight vectors. Let $\mathbf{w}_k$ denote the $k$-th columns vectors of $W_\theta$ and $\phi_k = \arccos\left( \frac{\mathbf{w}_k^T \mathbf{z}}{\|{\mathbf{w}_k}\| \|\mathbf{z}\|} \right)$, then, the logit is given by:
\begin{equation} \label{eq:3}
    \ell_k = \mathbf{w}_k^T\mathbf{z} = \|\mathbf{w}_k\|\|\mathbf{z}\|\cos(\phi_k).
\end{equation}

Substituting this into the Softmax probability expression, we have:
\begin{equation} \label{eq:4}
    \mathbb{P}(y = k | \mathbf{x}) = \frac{\exp  (\|\mathbf{w}_k\|\|\mathbf{z}\|\cos(\phi_k)) }{\sum_{i=1}^K  \exp (\|\mathbf{w}_i\|\|\mathbf{z}\|\cos(\phi_i))}.
\end{equation} 

\citet{calib} hypothesizes that the confidence assigned to an input’s most likely class is strongly influenced by the norm of its feature representation. However, because the norm is unconstrained, it may become less sensitive to the difficulty of the input.

\begin{algorithm}[H]
\caption{Normalization of Features}
\begin{algorithmic}
\Function{forward}{$x$}
    \State $\mathbf{z} \gets h_{\theta}(\mathbf{x})$
    \State featurenorm $\gets \|\mathbf{z}\|$
    \State $\mathbf{z} \gets \frac{\mathbf{z}}{\|\mathbf{z}\|}$
    \State $y \gets g_\theta(\mathbf{z})$
    \State \Return $y$, featurenorm
\EndFunction

\end{algorithmic}  \label{alg:L2_normalisation}
\end{algorithm}

To address this, \citet{L2norm} propose applying $L^2$ normalization to the embedding features $\mathbf{z} = h_\theta(\mathbf{x})$ transforming them into $\frac{\mathbf{z}}{\|\mathbf{z}\|}$ before computing the logits.  This step decouples the feature magnitudes from equinormality constraints. By normalizing the embeddings only during training, the method preserves variability in feature norms, allowing them to better capture input-specific difficulty.

Importantly, doing so ensures that the feature norms of OoD samples are much lower than ID's, making them an effective indicator for OoD detection. This work is the foundation of our method. To the reader's convenience, the normalization is presented in Algorithm~\ref{alg:L2_normalisation}.
\section{Stochastic Embedding Dynamics} \label{sec:counter}

As discussed in Section~\ref{sec:related}, adding stochasticity to the final layer alone does not directly provide an effective solution for OoD detection. To explore its potential benefits, we first examine a DNN trained using the CE loss with Dropout on the embedding. 

A notable observation, shown in Fig.~\ref{fig:embedding_var1}, is that applying MC Dropout to the penultimate layer during inference consistently results in  $\text{Var}(Z_\text{OoD}) < \text{Var}(Z_\text{ID})$. 
This result may seem counterintuitive, as OoD inputs are usually expected to exhibit higher variance.
\begin{figure}[]
    \centering
    \includegraphics[width=1\linewidth]{vit_cifar_cifar100_crop.pdf}
    \caption{Histograms of $\Var(Z_\text{ID})$ (in blue) and $\Var(Z_{\text{OoD}})$ (in red), derived by applying Dropout during the inference phase to generate $Z_\text{ID}$ for ID inputs (ImageNet) and $Z_\text{OoD}$ for OoD inputs (Textures).}
    \label{fig:embedding_var1}
\end{figure}

\subsection{Assumptions}
\label{sec:analysis}

To understand why $\text{Var}(Z_\text{OoD}) < \text{Var}(Z_\text{ID}) $ is observed, we analyze the embedding space under specific assumptions. 

Considering our trained DNN, the first step is to introduce the set of OoD samples detected by MSP during inference: 
\begin{align}
D_{\mathrm{OoD}}^{\mathrm{MSP}} = \Bigl\{ (\mathbf{x},y) \mid \mathbf{z}=h_\theta(\mathbf{x}), \; \|\mathbf{z}\| \le \tau \Bigr\} \nonumber\\ \quad
\cup \Bigl\{ (\mathbf{x},y) \mid \mathbf{z}=h_\theta(\mathbf{x}), \; \|\mathbf{z}\| > \tau \Bigr\}, 
\end{align}
for some $\tau \in \mathbb{R}^*_+$. 
We focus on OoD samples with low feature norms:
\[
D_{\mathrm{OoD}}^{\mathrm{MSP}}(\tau) := \Bigl\{ (\mathbf{x},y) \mid \mathbf{z}=h_\theta(\mathbf{x}), \; \|\mathbf{z}\| \le \tau \Bigr\}.
\]
Let $\varepsilon \in [\varepsilon_{\min},1]$, where
\[
\varepsilon_{\min} = 1 - \max_{\mathbf{x},\, j} \frac{\mathbf{w}_j^\top\, h_\theta(\mathbf{x})}{\|\mathbf{w}_j\|\|h_\theta(\mathbf{x})\|}.
\]
Finally, let us partition $D_{\mathrm{OoD}}^{\mathrm{MSP}}(\tau)$ into
\[
\begin{aligned}
D_{\mathrm{OoD}}^{\mathrm{MSP}}(\tau) 
&= \Bigl\{ (\mathbf{x},y) \mid \exists\, j \in \llbracket 1,K\rrbracket : \cos(\phi_j) \ge 1 - \varepsilon \Bigr\} \\
&\cup \Bigl\{ (\mathbf{x},y) \mid \forall\, j \in\llbracket 1,K\rrbracket: \cos(\phi_j) < 1 - \varepsilon \Bigr\},
\end{aligned}
\]
For the sake of theoretical study, if we suppose that:
\begin{itemize}
    \item The DNN $f_\theta$ is trained using the regular CE loss,
    \item  Then NC occurs along with the configurations described by~\citet{softmax},
\end{itemize}
then we can safely assume that the set defined in the following Lemma~\ref{lem:1} is non-empty.
\begin{lemma}[See Appendix \ref{subsec:theory_graphics}]
\label{lem:1}
Let $\mathbf{x}_{\mathrm{ID}} \sim \mathbb{P}_{\mathrm{ID}}(\mathbf{x})$ be an ID sample that is correctly classified by MSP as ID such as $\exists k \in \llbracket 1,K \rrbracket, \cos(\phi_k) =1$ , and define $\mathbf{z}_{\mathrm{ID}} = h_\theta(\mathbf{x}_{\mathrm{ID}})$. 
Let $(\mathbf{x}_{\mathrm{OoD}}, \upsilon)$ be an OoD sample such that $\mathbf{z}_{\mathrm{OoD}} = h_\theta(\mathbf{x}_{\mathrm{OoD}})$ and 
\begin{equation}
(\mathbf{x}_{\mathrm{OoD}},\upsilon) \in \Bigl\{ (\mathbf{x},y) \mid \exists\, j \in \llbracket 1,K\rrbracket :\; \cos(\phi_j) \ge 1-\varepsilon \Bigr\},   
\end{equation}

Then, for a suitably chosen $\tau$, we have
\begin{equation}
    \|\mathbf{z}_{\mathrm{OoD}}\| \leq \|\mathbf{z}_{\mathrm{ID}}\|.
\end{equation}
\end{lemma}
\begin{proof}
    We refer the reader to Appendix~\ref{sec:proof}.
\end{proof}
Now, the next step is to incorporate geometric and probabilistic concepts to model the Dropout effect when applied to the embedding during inference. as MC Dropout applied to an embedding $\mathbf{z} = \|\mathbf{z}\| \boldsymbol{\varphi}$ perturbs both its norm and its direction.

We first analyze the scenario where only the directional component is affected, as modeled by Theorem~\ref{thm:theorem43}.
\subsection{Spherical Cap Geometry and its Role in Embedding Dispersion}

\begin{theorem}[see Appendix~\ref{subsec:theory_graphics}]
\label{thm:theorem43}
Let $\mathbf{z} \in \mathbb{R}^D$ be an embedding vector, and write
\begin{equation}
   \mathbf{z} = \|\mathbf{z}\| \boldsymbol{\varphi}, 
\end{equation}
for some fixed unit vector $\boldsymbol{\varphi} \in S^{D-1}$. Let $\Phi \in [0,\pi]$ be given, and define the spherical cap
\begin{equation}
    C_\Phi(\boldsymbol{\varphi}) \;=\; \Bigl\{ \boldsymbol{\alpha} \in S^{D-1} : \boldsymbol{\alpha}^\top \boldsymbol{\varphi} \ge \cos \Phi \Bigr\}.
\end{equation}
The concentration parameter $ c = \mathbb{E}\bigl[\boldsymbol{\alpha}^\top\boldsymbol{\varphi}\bigr]$ quantifies how tightly the perturbed directions are distributed around $\boldsymbol{\varphi}$. 


For $M \in \mathbb{N}^*$ we suppose $\{\boldsymbol{\alpha}^{(1)}, \boldsymbol{\alpha}^{(2)}, \dots\ \boldsymbol{\alpha}^{(M)} \}$ is a sequence of i.i.d. random vectors on $C_\Phi(\boldsymbol{\varphi})$. 

We define the matrix:
\begin{equation}Z_M := \|\mathbf{z}\|A_M := \|\mathbf{z}\|\left(\boldsymbol{\alpha}^{(1)}, \boldsymbol{\alpha}^{(2)}, \dots, \boldsymbol{\alpha}^{(M)}\right),
\end{equation}
and we denote $\Var(Z_M)$ its variance estimator. 
Then the following properties hold:
\begin{enumerate}
    \item \textbf{Finite Expectation:} For all $M \ge 1$,
    \begin{equation}
         \mathbb{E}\Bigl[\Var(Z_M)\Bigl] = \|\mathbf{z}\|^2 \,\bigl(1-c^2\bigr).
    \end{equation}
    \item \textbf{Almost-Sure Asymptotics:} As $M \to \infty$,
   \begin{equation}
       \Var(Z_M) \;\overset{\text{a.s.}}{\longrightarrow}\; \|\mathbf{z}\|^2 \,\bigl(1-c^2\bigr).
   \end{equation} 

\end{enumerate}
\end{theorem}



\begin{proof}
We refer the reader to Appendix~\ref{sec:proof}.     
\end{proof}



%\textcolor{brown}{The parameter $c^2 \in [0,1]$ depends on the angle $\Phi$ and $D$, its explicit computation is made in Appendix \ref{sec:comput_c}.} 

To interpret the role of $c^2$, note that when $M \to +\infty$ and $c^2 = 0$, the perturbed directions are uniformly distributed over the hypersphere. Conversely, if $c^2 = 1$, the perturbations are fully concentrated around $\boldsymbol{\varphi}$.

For simplicity, we assumed i.i.d random vectors to be uniformly distributed over  $C_\Phi(\boldsymbol{\varphi})$, i.e., locally uniform over the hypersphere. Of course, this assumption may be extended by adopting any relevant statistical model with a well‐defined variance supported on the hypersphere.


Now, we have all the necessary components to explain the observation in Fig.~\ref{fig:embedding_var1}
. 
\subsection{How Feature Norms Amplify Variance in ID Data}
Let $\mathbf{x}_{\text{ID}} \sim \mathbb{P}_{\text{Test}}(\mathbf{x})$ and $\mathbf{x}_{\text{OoD}} \sim \mathbb{P}_{\text{OoD}}(\mathbf{x})$ be samples whose original embeddings denoted by $\mathbf{z}_{\text{ID, orig}}$ and $\mathbf{z}_{\text{OoD, orig}}$ satisfy the conditions of Lemma~\ref{lem:1}, such that $\|\mathbf{z}_{\text{OoD, orig}}\| \leq \|\mathbf{z}_{\text{ID, orig}}\|$.

Under the model of Theorem~\ref{thm:theorem43} where only the direction is perturbed and the original norm is fixed, we have:
\begin{align}
    \mathbb{E}[\Var(Z_\text{ID})] &= \|\mathbf{z}_{\text{ID, orig}}\|^2(1-c_{\text{ID}}^2), \label{eq:var_id_ideal} \\
    \mathbb{E}[\Var(Z_\text{OoD})] &= \|\mathbf{z}_{\text{OoD, orig}}\|^2(1-c_{\text{OoD}}^2). \label{eq:var_ood_ideal}
\end{align}
Assuming comparable angular dispersion i.e., $c_{\text{ID}}^2 =c_{\text{OoD}}^2$ due to standard training not differentiating this aspect for the considered samples, Lemma~\ref{lem:1} implies  \begin{equation} \mathbb{E}[\Var(Z_\text{OoD})] \leq \mathbb{E}[\Var(Z_\text{ID})]
\end{equation}
We now extend this to the general case where MC Dropout perturbs both the norm and direction.
Let $\mathbf{z}'$ be an embedding after application of the Dropout mask, such that $\mathbf{z}' = s\boldsymbol{\alpha} \|\mathbf{z}_{\text{orig}}\|$, where $s \in [0,1]$ is a stochastic norm scaling factor and $\boldsymbol{\alpha}$ is the stochastic unit direction. We assume the following:
\begin{enumerate}
    \item \textbf{Independent Norm Scaling:} The random variable $S$ (for $s_m$) is independent of the original norm and has the same distribution for ID and OoD samples. Let $\kappa = \mathbb{E}[S^2]$, where $0 < \kappa \le 1$. Thus, the average squared post-Dropout norm is $\mathbb{E}_{\text{masks}}[\|\mathbf{z}'\|^2] = \mathbb{E}[S^2 \|\mathbf{z}_{\text{orig}}\|^2] = \kappa \|\mathbf{z}_{\text{orig}}\|^2$.
    \item \textbf{Decoupled uniform Perturbations:} While $s$ and $\boldsymbol{\alpha}$ both arise from the same Dropout mask, we approximate that the variance structure from Theorem~\ref{thm:theorem43} can be applied by replacing the fixed $\|\mathbf{z}\|^2$ with $\mathbb{E}_{\text{masks}}[\|\mathbf{z}'\|^2]$. 
\end{enumerate}
Under these conditions, the average post-Dropout norms are:
\begin{align}
    \mathbb{E}_{\text{masks}}[\|\mathbf{z}'_{\text{ID}}\|^2] &= \kappa \|\mathbf{z}_{\text{ID, orig}}\|^2 \\
    \mathbb{E}_{\text{masks}}[\|\mathbf{z}'_{\text{OoD}}\|^2] &= \kappa \|\mathbf{z}_{\text{OoD, orig}}\|^2
\end{align}
Since $\|\mathbf{z}_{\text{OoD, orig}}\|^2 \leq \|\mathbf{z}_{\text{ID, orig}}\|^2$ and $\kappa > 0$, it follows that $\mathbb{E}_{\text{masks}}[\|\mathbf{z}'_{\text{OoD}}\|^2] \leq \mathbb{E}_{\text{masks}}[\|\mathbf{z}'_{\text{ID}}\|^2]$.
If the angular concentration parameters $c_{\text{ID}}^2$ and $c_{\text{OoD}}^2$ remain comparable, i.e., $c_{\text{ID}}^2 = c_{\text{OoD}}^2$, then:
\begin{align}
    \mathbb{E}[\Var(Z_\text{ID})] & = \kappa \|\mathbf{z}_{\text{ID, orig}}\|^2 (1-c_{\text{ID}}^2) \\
    \mathbb{E}[\Var(Z_\text{OoD})] &= \kappa \|\mathbf{z}_{\text{OoD, orig}}\|^2 (1-c_{\text{OoD}}^2)
\end{align}
This leads to $\mathbb{E}[\Var(Z_\text{OoD})] \leq \mathbb{E}[\Var(Z_\text{ID})]$. 

 While our model, particularly the decoupling approximation,  simplifies the complex interaction of norm and directional perturbations from Dropout, it provides a rationale for the observed variance difference. The inherent dual impact of Dropout nonetheless complicates general ID/OoD separation, as suggested by phenomena in Fig.~\ref{fig:embedding_var1} and Fig.~\ref{fig:toy_separation}.

In conclusion, our analysis suggest that while MC Dropout introduces stochasticity into the embeddings, it does so in an uncontrolled way by perturbing both the norm and the direction simultaneously. This mixing of effects leads to the observed higher variance for ID samples primarily due to their overall larger norms for the class of samples considered.
\section{Norm and angular decoupling}

It might be tempting to optimize the variance difference i.e., to force $\Var(Z_\text{OoD}) \ll \Var(Z_\text{ID})$ as a means to distinguish between ID and OoD data. \textbf{Instead}, our strategy pursues an alternative approach that does not rely on enhancing such variance differences but rather consists of decoupling norm from the angular component of the embedding vector:
\begin{enumerate}
  \item We impose some constraints on the angular concentration, a parameter that remained unconstrained in the standard Dropout setup described in Sec.~\ref{sec:counter}. 
    To achieve this, we first add a fully-connected DropConnect layer after the embedding~\citep{wan2013regularization}, then apply normalization, thereby leveraging the Central Limit Theorem and concentration of measure phenomena.
    \item A fortunate byproduct of this design is that we naturally integrate the L2-normalization strategy from~\citet{L2norm}, as detailed in Algorithm~\ref{alg:L2_normalisation}.
 
\end{enumerate}


\subsection{Training phase}

\begin{algorithm}[h]
\caption{Training Phase}
\begin{algorithmic}[1] 
\State \textbf{Input (Training):} Train input $\mathbf{x}$, Feature extractor $h_\theta(\cdot)$, DropConnect function $\text{DC}(\cdot)$, Classifier $g_\theta(\cdot)$
\For{each batch of data $\mathbf{x}$}
    \State $\mathbf{z} \gets h_\theta(\mathbf{x})$
    \State $r \gets \|\mathbf{z}\|$
    \State $\boldsymbol{\alpha} \gets \mathbf{z}/r $
    \State $\boldsymbol{\alpha}_{DC} \gets \text{DC}(\boldsymbol{\boldsymbol{\alpha}})$
    \State $\boldsymbol{\ell} \gets g_\theta(\frac{\boldsymbol{\alpha}_{DC}}{\|\boldsymbol{\alpha}_{DC}\|})$
    \State Compute loss $\mathcal{L}$ using $\boldsymbol{\ell}$ and labels
    \State Back-propagate to update network weights $\theta$
\EndFor
\end{algorithmic} \label{alg:alg_train}
\end{algorithm}

During the training step, for each input $\mathbf{x}$ passed through the feature extractor \( h_\theta(\mathbf{x}) \), we get an embedding vector $\mathbf{z}$. To introduce random rotation, a fully-connected stochastic linear layer DC(.) utilizing DropConnect and matching the dimensionality of $\mathbf{z}$ is added after the embedding layer.


The normalized embedding is passed through the DropConnect function $\text{DC}(\cdot)$, producing a stochastically perturbed vector $\boldsymbol{\boldsymbol{\alpha}}_{DC}$.
Since the output is not guaranteed to be a unit vector, we normalize it again. The DC layer, combined with normalization, stretches, distorts, and projects the vector ${\boldsymbol{\alpha}}$ onto the hypersphere.

Using DropConnect means that the fully connected layer $DC : \mathbb{R}^D \to \mathbb{R}^D$ outputs each component $\boldsymbol{\alpha}_i$ of $\boldsymbol{\alpha}_{DC}$ as a sum of many independent and uniformly bounded contributions, each multiplied by a Bernoulli random variable. Consequently, by the Central Limit Theorem through the Lindeberg's condition~\citep{lindeberg1922}, each component of 
$\boldsymbol{\alpha}_{DC}$ asymptotically satisfies:
\begin{equation}
 \forall i \in \llbracket1,D \rrbracket,  \sqrt{D}\,({\boldsymbol{\alpha}}_{DC})_i \xrightarrow{d} \mathcal{N}(\delta_i,\sigma_i^2) 
\end{equation}
as $D \to + \infty$ and $\xrightarrow{d}$ denotes convergence in distribution. Since the components are independent, the entire random vector $\boldsymbol{\alpha}_{DC}$ is asymptotically Gaussian. Consequently, $\boldsymbol{\alpha}_{DC}$ behaves as a Gaussian random vector with diagonal covariance matrix, and the normalized version  $\frac{\boldsymbol{\alpha}_{DC}}{\| \boldsymbol{\alpha}_{DC}\|}$ is distributed over a spherical cap. To simplify the presentation, we kept assuming that the normalized vector
is uniformly distributed over a spherical cap. A more precise study of this statistical model with its true distribution is provided in Appendix~\ref{sec:concentration_ineq}.

Training in this way creates meaningful angular differences between ID and OoD. Indeed, the exposition of ID data to angular perturbation during the training refines the network, making the model invariant to the specific angular perturbations introduced by DropConnect and effectively confining ID inputs within a smaller spherical cap (i.e., $c_{\text{ID}}^2 \simeq 1$) as illustrated in Appendix, Fig.~\ref{fig:capID_OoD} and observed in Fig.~\ref{fig:p_expect}.

\subsection{Inference phase}

\begin{algorithm}[h!]
\caption{Inference Phase and OoD Detection}
\begin{algorithmic}[1]  

\State \textbf{Input:} Test input $\mathbf{x}$, Feature extractor $h_\theta(\cdot)$, DropConnect function $\text{DC}(\cdot)$, Classifier $g_\theta(\cdot)$, Number of forward passes $M$
\State $\mathbf{z} \gets h_\theta(\mathbf{x})$
\State $r \gets \|\mathbf{z}\|$
\State $\boldsymbol{\boldsymbol{\alpha}} \gets \mathbf{z}/r$
\For {$m = 1$ \textbf{to} $M$}
    \State $\boldsymbol{\alpha}_{DC}^{(m)} \gets \text{DC}(\boldsymbol{\alpha})$
\State $\boldsymbol{\alpha}_{DC}^{(m)} \gets \frac{\boldsymbol{\alpha}_{DC}^{(m)}}{\|\boldsymbol{\alpha}_{DC}^{(m)}\|}$
\EndFor
\State In a validation set, compute $\bar{r} = \frac{1}{IQR \times N}\sum_{i=1}^N r_i$, $r_i$ is the norm of the $i$-th element and $IQR$ is the Interquartile Range.
\State Define the OoD score as $S_{DC}(\mathbf{x}) := \mathrm{Var}(A_{DC}) + \lambda\frac{\bar{r} -r}{r}$. 
\State \textbf{Output:} OoD Score $S_{DC}(\mathbf{x})$
\end{algorithmic}\label{alg:alg_test}
\end{algorithm}
At inference time, we keep the DropConnect stochasticity active by applying the DC layer immediately after the embedding layer, followed by normalization, using the same DropConnect rate as during training.
This results in $M$ different perturbations of the angle $\boldsymbol{\alpha}_{DC}^{(i)}, i \in \llbracket 1,M\rrbracket$,  all associated with the same norm $\|\mathbf{z}\|$ which is held constant across perturbations.

After completing the  $M$ passes, the variance of the matrix $A_{DC} = (\mathbf{\boldsymbol{\alpha}}_{DC}^{(1)},..., \boldsymbol{\alpha}_{DC}^{(M)})$ is calculated. If $\mathbf{z} = h_\theta(\mathbf{x})$ and $r = \|\mathbf{z}\|$, we define the score as
\begin{equation} S_{DC}(\mathbf{x}) := \Var(A_{DC}) + \lambda\frac{\bar{r} - r}{r},
\end{equation}
where $\bar{r}$ is computed as the mean norm divided by the interquartile range (IQR) of the norms on a validation set with $IQR = Q(0.75) - Q(0.25)$, and $Q(p)$ is the $p$-th quantile. 

Dividing by the IQR makes the norm score robust to outliers and provides a consistent scaling factor that reflects both the central tendency and variability of the validation set.

The hyperparameter $\lambda$ can be chosen, for instance, as the $90$th percentile (or another appropriate quantile) of the score distribution computed on ID data from the validation set.

Note that using only $\Var(Z)$ to separate ID from OoD led to poor and unstable performance likely due to a mismatch in the optimization objective. Indeed, $\Var(Z_{\text{ID}}) = r^2\Var(A_{\text{ID}})$ and while $r^2$ increases for ID data,  $\Var(A_{\text{ID}})$ decreases. The opposite occurs for $\Var(Z_{\text{OoD}})$. 
\section{Experiments}
We applied a high DropConnect rate on the linear DropConnect layer, with empirical results showing \textbf{that $p \in [0.8, 0.9]$ yields optimal performance}.

During the inference phase, we used $M=50$ forward passes to compute $\Var(Z)$. While this number may initially appear low given the size of the embedding space, working on the unit hypersphere allows us to benefit of the blessing of dimensionality namely, the concentration of measure in high-dimensional spaces, which ensure that even a moderate number of passes provides a reliable estimation of the variance, as shown in Appendix~\ref{app:dropconnect} and \ref{sec:concentration_ineq}.
\subsection{Effect of DropConnect on the hypersphere}
\label{sec:concentration}

\begin{figure}[H]
    \centering
    \includegraphics[width=1\linewidth]{pexpec-cropped.pdf}
\caption {Evolution of $1-c^2$ as $p \to 1$. }
    \label{fig:p_expect}
\end{figure}
To derive Fig.~\ref{fig:p_expect}, we trained the same model using various DropConnect rates $p \in \{0.1, \ldots, 0.9\}$. For each fixed $p$ and for every test input $\mathbf{x} \sim \mathbb{P}_{\text{Test}}(\mathbf{x})$, we performed $M = 50$ forward passes without rescaling by the norm, thereby obtaining a $1-c^2$ value per input. We then computed the mean and standard deviation of these values across the entire test and OoD datasets. The error bars indicate the standard deviation. Finally, we plotted these estimates as a function of $p$.
As $p \to 1$, the angular component concentration becomes clearly separated and statistically significant (e.g., for $p = 0.8$ and $p = 0.9$) between ID and OoD data. We also observe that all test ID inputs exhibit tightly concentrated $c_{\text{ID}}^2$ values, consistent with the concentration of measure in high dimensions. In contrast, the $c_{\text{OoD}}^2$ values for OoD inputs show significantly more dispersion, indicating a \textbf{weaker} concentration phenomena, suggesting that the angular response of the model to OoD data is more variable and less predictable. 
\subsection{OoD DETECTION}
We built our OoD benchmark around embedding‐based, post‐hoc detectors chosen for their ease of integration and proven effectiveness. These techniques are prevalent in the OoD community and serve as a solid foundation for our tests, with several recognized as state-of-the-art on large‐scale architectures.
To evaluate the effectiveness of these methods, we report three metrics: the area under the ROC curve (AUROC), the area under the precision–recall curve (AUPRC), and the false positive rate at 95\% true positive rate (FPR95). As shown in Tables~\ref{tab:cifar100_ood},  \ref{tab:cifar10_ood} and \ref{tab:imagenet_all_ood}, our composite score consistently ranks among the top three across diverse benchmarks, demonstrating robustness and generality across datasets. It is worth noting that DeepKNN~\citep{deepknn} often tops these benchmarks largely because it normalizes every embedding before performing the k-nearest-neighbours. That norm–based separation amplifies the gap between ID and OoD points, probably giving DeepKNN an edge. Indeed, \citet{sun2021react, azizmalayeri2024mitigating} observed that OoD activations tend to be sparser than ID activations, so normalizing these vectors may make them appear even more sparse on the unit sphere. The increased sparsity of OoD activations, once passed through the DC layer, may be responsible for the observed amplification of angular variance and may explain why our method naturally amplifies the angular-variance term for OoD inputs.

\begin{table}[H]
  \centering\small
  \caption{OOD detection on CIFAR-100 (ID) → SVHN and CIFAR-10 (OOD) using ResNet-18.}
  \label{tab:cifar100_ood}
  %--- SVHN ---
  \textbf{SVHN}\\
  \begin{tabular}{l ccc}
    \toprule
    \textbf{Method}      & \textbf{AUROC}     & \textbf{AUPRC}       & \textbf{FPR@95TPR}  \\
    \midrule
    MSP                  & 84.69              & 86.67                & 57.28               \\
    MaxLogit             & 83.57              & 86.96                & 76.96               \\
    ReAct                & 83.12              & 83.33                & 57.01    \\
    Energy Score & 	86.51 	& 57.79 	&98.40\\
    ASH B                & 84.09              & 83.01                & 58.00               \\
    ASH P                & 84.22              & 81.88                & 56.11               \\
    ASH S                & 88.01              & 83.10                & 54.99               \\
    DeepKNN              & \underline{93.61}  & \textbf{94.15}       & 52.43               \\
    DDU                  & 80.63              & 57.09                & 94.57               \\
    Norm  (Feature)                & 86.95              & 91.73                & \underline{50.69}   \\
    ViM                  & 92.81              & 92.66                & \textbf{49.67}      \\
    Mahalanobis          & 82.21              & 83.03                & 91.01               \\
    Naive Sampling       & 72.33              & 76.27                & 60.28               \\
    LogitNorm &  	82.27& 	59.66 &	86.45\\
    Ours                 & \textbf{94.23}     & \underline{93.85}    & 65.23               \\
    \bottomrule
  \end{tabular}
    \centering\small

  %--- CIFAR-10 ---
  \textbf{CIFAR-10}\\
  \begin{tabular}{l ccc}
    \toprule
    \textbf{Method}      & \textbf{AUROC}     & \textbf{AUPRC}       & \textbf{FPR@95TPR}  \\
    \midrule
    MSP                  & 76.75              & 81.21                & 73.87               \\
    MaxLogit             & 73.79              & 80.14                & 90.94               \\
    
    Energy Score 	& 75.51 	&80.44& 	97.80 \\
    ReAct                & 76.44              & 82.11                & 75.44               \\
    ASH B                & 72.20              & 81.01                & 74.17               \\
    ASH P                & \textbf{77.90}     & 82.84                & 76.04               \\
    ASH S                & 71.93              & 82.11                & 75.14               \\
    DeepKNN              & \underline{77.66}  & \textbf{83.15}       & 71.91               \\
    Energy Score         & 75.51              & 80.44                & 97.80               \\
    DDU                  & 74.74              & 80.68                & 97.96               \\
    Norm    (Feature)              & 76.37              & 80.76                & \textbf{70.01}      \\
    ViM                  & 76.01              & 82.19                & 81.33               \\
    Mahalanobis          & 73.92              & 82.03                & 85.01               \\
    Naive Sampling       & 69.24              & 77.44                & 76.30               \\
    LogitNorm           & 	74.78&	79.49 	&73.03\\

    Ours                 & 77.27              & \underline{82.92}    & \underline{70.46}   \\
    \bottomrule
  \end{tabular}

\end{table}
\section{Conclusion}
We present an exploratory study and a mathematically grounded method for enhancing OoD detection. This work focuses on exploratory analysis and modeling to explain the geometric and probabilistic phenomena observed in embedding spaces. Our exploration revealed that when applying MC Dropout to the embedding layer, ID samples tended to exhibit higher variance than OoD samples primarily due to their larger feature norms. This observation highlighted a critical limitation: MC Dropout affects both norm and angle in an uncontrolled manner, which obscures the true uncertainty signal needed to differentiate between ID and OoD data. 
By establishing a link between uncertainty and concentration of measure, our OoD score integrates controlled angular variance using DropConnect and a norm-based component, leveraging both directional and magnitude information in the embeddings. We hope this connection will offer useful insights and stimulate further interest.
\begin{table}[H]
  \centering\small
  \caption{OOD detection on CIFAR-10 (ID) $\to$ SVHN and CIFAR-100 (OoD) using ResNet-18.}
  \label{tab:cifar10_ood}
  %--- SVHN ---
  \textbf{SVHN}  \\
  \begin{tabular}{l ccc}
    \toprule
    \textbf{Method}      & \textbf{AUROC}     & \textbf{AUPRC}       & \textbf{FPR@95TPR}  \\
    \midrule
    MSP                  & 87.17              & 92.59                & 38.02               \\
    MaxLogit             & 90.70              & 95.41                & 45.84               \\
    Energy Score         & 90.94              & 52.46                & 99.78               \\
    ReAct                & 87.57              & 92.22                & 44.02               \\
    
    ASH B                & 79.44              & 84.01                & 63.01               \\
    ASH P                & 83.99              & 89.12                & 54.56               \\
    ASH S                & 82.01              & 92.11                & 49.03               \\
    DeepKNN              & \underline{95.19}  & 97.26                & \textbf{9.83}       \\
    DDU                  & 84.09              & 56.70                & 87.85               \\
    Norm    (Feature)              & 94.89              & 97.68                & 24.22               \\
    ViM                  & 95.17              & \textbf{98.68}       & 21.05               \\
    Mahalanobis          & 88.45              & 67.34                & 79.12               \\
    Naive Sampling       & 81.11              & 83.55                & 88.11               \\
    LogitNorm &  	 	93.05 	&70.66& 	80.45 \\
    Ours                 & \textbf{95.37}     & \underline{98.52}    & \underline{18.60}   \\
    \bottomrule
  \end{tabular}
  \centering\small
  %--- CIFAR-100 ---
  \textbf{CIFAR-100}\\
  \begin{tabular}{l ccc}
    \toprule
    \textbf{Method}      & \textbf{AUROC}     & \textbf{AUPRC}       & \textbf{FPR@95TPR}  \\
    \midrule
    MSP                  & 80.62              & 77.54                & 72.62               \\
    MaxLogit             & 75.90              & 76.65                & 78.34               \\
    Energy Score         & 75.93              & 64.18                & 99.47               \\

    ReAct                & 81.77              & 77.19                & 72.12               \\
    ASH B                & 74.11              & 71.01                & 72.21               \\
    ASH P                & 85.99              & 86.12                & 64.56               \\
    ASH S                & 81.01              & 82.11                & 66.99               \\
    DeepKNN              & \textbf{88.51}     & \textbf{86.30}       & \underline{40.33}   \\
    DDU                  & 83.55              & 66.49                & 98.74               \\
    Norm (Feature)                & 87.98              & 86.09                & \textbf{40.01}      \\
    ViM                  & 87.52              & 85.68                & 50.05               \\
    Mahalanobis          & 84.79              & 71.44                & 91.01               \\
    Naive Sampling       & 77.10              & 68.11                & 91.43               \\
    LogitNorm           & 82.78 	&63.49 	&80.03 \\

    Ours                 & \underline{88.01}  & \underline{86.27}    & 47.77               \\
    \bottomrule
  \end{tabular}
\end{table}
\begin{table}[H]
  \centering\small
  \caption{OOD detection on ImageNet (ID) vs three OOD sets (ResNet-50).}
  \label{tab:imagenet_all_ood}
  %--- NINCO ---
  \textbf{NINCO}\\
  \begin{tabular}{l ccc}
    \toprule
    \textbf{Method}    & \textbf{AUROC}   & \textbf{AUPRC}    & \textbf{FPR@95TPR} \\
    \midrule
    MSP                & 83.20            & 58.87             & 67.79              \\
    MaxLogit           & 86.67            & 64.52             & 52.85              \\
    Energy Score       & 81.85            & 61.01             & 99.82              \\
    ReAct              & 81.61            & 48.19             & 73.11              \\
    ASH-P              & 78.54            & 55.78             & 66.54              \\
    ASH-B              & 91.04            & 74.04             & 55.67              \\
    ASH-S              & 88.56            & \textbf{79.11}    & 44.11              \\
    Norm(Feature)      & 87.49            & 69.37             & 40.87              \\
    DeepKNN            & \textbf{93.80}   & \underline{77.12} & \textbf{14.06}     \\
    ViM                & 92.14            & 73.56             & 25.21              \\
    Mahalanobis     & 85.23 & 71.83 & 49.36 \\
    DDU &           83.12 & 67.93 & 41.22 \\
    Naive Sampling & 79.45 & 59.74 & 55.09\\
    LogitNorm & 92.22 & 71.59 & 22.30 \\
    Ours               & \underline{93.61}& 76.19             & \underline{21.03}  \\
    \bottomrule
  \end{tabular}
  \vspace{1ex}
  %--- Textures ---
  \textbf{Textures}\\
  \begin{tabular}{l ccc}
    \toprule
    \textbf{Method}    & \textbf{AUROC}   & \textbf{AUPRC}    & \textbf{FPR@95TPR} \\
    \midrule
    MSP                & 69.32            & 60.59             & 85.30              \\
    MaxLogit           & 75.81            & 64.92             & 83.14              \\
    Energy Score       & 27.11            & 52.27             & 99.75              \\
    ReAct              & 74.12            & 64.91             & 90.12              \\
    ASH-P              & 83.42            & 70.32             & 85.46              \\
    ASH-B              & 65.24            & 59.73             & 99.53              \\
    ASH-S              & 79.93            & 66.98             & 77.22              \\
    Norm(Feature)      & 80.79            & 65.72             & 76.14              \\
    DeepKNN            & \textbf{85.06}   & \underline{73.97} & \underline{62.58}  \\
    ViM                & 84.55            & 72.09             & \textbf{60.12}     \\
     Mahalanobis     & 77.02 & 58.13 & 93.84 \\
    DDU &           79.33 & 71.36 & 81.22 \\
    Naive Sampling & 71.74 & 58.22 & 65.27\\
    LogitNorm & 84.88 & 69.46 & 62.00 \\
    Ours               & \underline{84.97}& \textbf{74.10}    & 68.11              \\
    \bottomrule
  \end{tabular}

  \vspace{1ex}
  %--- Places365 ---
  \textbf{Places365}\\
  \begin{tabular}{l ccc}
    \toprule
    \textbf{Method}    & \textbf{AUROC}   & \textbf{AUPRC}    & \textbf{FPR@95TPR} \\
    \midrule
    MSP                & 73.58            & 85.40             & 79.86              \\
    MaxLogit           & 75.68            & 86.47             & 78.67              \\
    Energy Score       & 66.30            & 82.23             & 98.46              \\
    ReAct              & 75.11            & 84.89             & 91.12              \\
    ASH-P              & 85.08            & \underline{91.01} & 79.02              \\
    ASH-B              & 77.10            & 81.71             & 85.10              \\
    ASH-S              & 79.56            & 88.96             & 81.03              \\
    Norm(Feature)      & 82.46            & 89.56             & 76.45              \\
    DeepKNN            & 84.41            & 89.28             & \underline{60.33}  \\
    ViM                & \textbf{85.97}   & 88.56             & \textbf{41.39}     \\
     Mahalanobis     & 75.18 & 84.20 & 80.70 \\
    DDU &           73.44 & 80.36 & 92.39 \\
    Naive Sampling & 69.40 & 79.95 & 89.32 \\
    LogitNorm & 84.82 & 89.28 & 38.26 \\
    
    Ours               & \underline{85.10}& \textbf{91.23}    & 64.10              \\
    \bottomrule
  \end{tabular}
\end{table}



\newpage

\bibliography{my}

\newpage

\onecolumn

\title{Stochastic Embeddings : A Probabilistic and Geometric Analysis of Out-of-Distribution Behavior}
\maketitle
\appendix
\section{Visualization}
\label{subsec:theory_graphics}
Fig.~\ref{fig:theory} shows respectively the illustrations of the considered setting in Lemma~\ref{lem:1} and the mathematical spherical cap defined in Theorem~\ref{thm:theorem43}. In particular, Fig.~\ref{fig:theory}(a) illustrates the embedding's simplex configuration where the OoD embedding lies around the origin with high cosine similarity (and where ID data are clustered along the class vector), while Fig.~\ref{fig:theory}(b) depicts the spherical cap centered around the original embedding direction vector $\boldsymbol{\varphi}$, within which sampling is performed uniformly.

In Fig.~\ref{fig:capID_OoD} left (resp. right) blue arrow represents the initial direction $\boldsymbol{\varphi}$ of the vector $\mathbf{z} = \|\mathbf{z}\| \boldsymbol{\varphi} \in \mathbb{P}_{ \text{Test}}(\mathbf{x})$ (resp. $\mathbb{P}_{ \text{OoD}}( \mathbf{x}))$ . During inference, the green (resp. purple) arrows represents the $M$ perturbed vectors, induced by $M$ stochastic forward passes. Our method computes the variance on all these green vectors. $\Phi$ represent the spherical cap limits. Same in the right picture. As the illustration shows, when stochastically perturbed during inference, ID embeddings exhibit greater stability under stochastic perturbation than OoD embeddings, i.e., $c^2_{\text{OoD}} \leq c^2_{\text{ID}}$.

Fig.~\ref{fig:toy_separation} indicate a high correlation between the norm separation of the embeddings (see Fig.~\ref{subfig:toy_norm_visu}) and the variance separation (see Fig.~\ref{subfig:toy_var_visu}) under MC Dropout, as studied in Sec.~\ref{sec:counter}. 
\begin{figure}[h]
  \centering
  \begin{subfigure}[t]{0.45\linewidth}
    \centering
    \includegraphics[width=\linewidth]{ood.pdf}
    \caption{Illustration for Lemma~\ref{lem:1}.}
    \label{fig:theory_a}
  \end{subfigure}
  \begin{subfigure}[t]{0.41\linewidth}
    \centering
    \includegraphics[width=\linewidth]{cap_perturb.pdf}
    \caption{Illustration of the 2D spherical cap for Theorem~\ref{thm:theorem43}.}
    \label{fig:theory_b}
  \end{subfigure}
  \caption{(a) Simplex configuration where OoD embedding lies around the origin with high cosine similarity and ID data are clustered along the class vector. (b) Spherical cap defined in Theorem~\ref{thm:theorem43}.}
  \label{fig:theory}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{dual_caps_correct_phi_labels.pdf}
    \caption{Illustration of the post-training behavior: ID data exhibit more concentration as the DNN remains invariant to stochastic-induced perturbations. }
    \label{fig:capID_OoD}
\end{figure}


\begin{figure}[H]
    \centering
    \begin{subfigure}[t]{0.49\linewidth}
       \centering
    \includegraphics[width=\linewidth]{norm_embedding.png}
    \caption{Norm ID (MNIST) vs Norm OoD(CIFAR10).}
    \label{subfig:toy_norm_visu}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.49\linewidth}
        \centering
        \includegraphics[width=\linewidth]{toy_auroc_Crop.pdf}
        \caption{Variance ID (MNIST) vs. Variance OoD (CIFAR10).}
        \label{subfig:toy_var_visu}
    \end{subfigure}
\caption{Separation of ID/OoD embeddings on toy data. Great norm separation leads to great variance separation.}
    \label{fig:toy_separation}
\end{figure}
\newpage

\section{Additional experiments and details}


\subsection{Stochasticity on intermediate layers}
 We introduced stochasticity into intermediate layers using Dropout (following~\citet{kim2023use}). We evaluated the performance by comparing AUROC scores, with the injected stochasticity propagating through the feature extractor’s output during inference. We believe that the results shown in Table~\ref{tab:resnet18-drop} are not surprising, as intermediate layers typically capture lower-level features that are less discriminative for distinguishing between ID and OoD data.
\begin{table}[H]
    \centering
    \caption{
        \textbf{ResNet18 : CIFAR10 (Id) vs CIFAR100(OoD)}.}
    \label{tab:resnet18-drop}
    \begin{tabular}{lc}
        \toprule
        \textbf{Modified layer} & \textbf{AUROC (\%)} \\ 
        \midrule
        layer1 & 52.3 \\
        layer2 & 53.4 \\
        layer3 & 63.0 \\
        layer4 & 71.1 \\
        Embedding & \textbf{80.1} \\
        \bottomrule
    \end{tabular}
\end{table}


\subsection{DropConnect Parameter Sensitivity and Computational Cost}
\label{app:dropconnect}

In all of our experiments we use the same DropConnect rate $p$ during both training and inference, to ensure consistency of the trainable parameters in the stochastic layer.  While our main paper does not include a detailed empirical study of how $p$ affects in–distribution (ID) accuracy, we now present such results for completeness.

\paragraph{ID accuracy vs.\ DropConnect rate}
\begin{table}[H]
  \centering\scriptsize
  \caption{CIFAR–10 and CIFAR–100 ID accuracy (\%) as a function of DropConnect rate $p$.}
  \label{tab:app_drop_rate_accuracy}
  \begin{tabular}{c c c c}
    \toprule
    \textbf{DropConnect rate $p$} & \textbf{CIFAR–10 ID Acc.} & \textbf{CIFAR–100 ID Acc.} & \textbf{ImageNet ID Acc.}  \\
    \midrule
    0.1 & 91.83 & 71.21 & 76.11 \\
    0.2 & 91.53 & 72.44 & 75.99 \\
    0.3 & 91.76 & 71.98 & 75.52 \\
    0.4 & 90.96 & 71.90 & 75.40\\
    0.5 & 91.28 & 71.12 & 74.83\\
    0.6 & 91.26 & 72.35 & 75.76\\
    0.7 & 91.32 & 70.56 & 75.33\\
    0.8 & 91.09 & 71.04 & 75.11\\
    0.9 & 91.03 & 70.12 & 75.43\\
    \bottomrule
  \end{tabular}
\end{table}

In the main paper we select $p=0.9$ to enhance OOD separation on the hypersphere (see Fig.~\ref{fig:p_expect}).  noting that slightly lower values of $p$ can yield marginal gains in ID accuracy but at the cost of degraded OOD performance.  

\paragraph{Training time overhead}
Higher DropConnect rates incur slower convergence during training.  We measure the relative increase in wall‐clock training time (to reach the same validation loss) as seen in Table~\ref{tab:app_drop_rate_time}:
\begin{table}[H]
  \centering\scriptsize
  \caption{Relative training time increase (\%) vs.\ DropConnect rate $p$.}
  \label{tab:app_drop_rate_time}
  \begin{tabular}{c c}
    \toprule
    \textbf{Drop prob.\ $p$} & \textbf{Training time ↑ (\%)} \\
    \midrule
    0.1 &  0.0\%  \\
    0.2 &  6.5\%  \\
    0.3 & 12.7\%  \\
    0.4 &  6.5\%  \\
    0.5 & 12.3\%  \\
    0.6 & 19.1\%  \\
    0.7 & 26.5\%  \\
    0.8 & 28.4\%  \\
    0.9 & 36.2\%  \\
    \bottomrule
  \end{tabular}
\end{table}

\paragraph{Inference cost vs.\ number of passes}
Unlike standard MC techniques, our multiple stochastic passes can be started from the embedding layer, reducing cost. Table~\ref{tab:app_inference_time} shows average batch times (128 images) for a full forward‐backward pass vs.\ our optimized partial‐forward strategy:
\begin{table}[H]
  \centering\scriptsize
  \caption{Average inference time per batch for $M$ passes (ResNet‐50, 128 images).}
  \label{tab:app_inference_time}
  \begin{tabular}{c c c c}
    \toprule
    \textbf{$M$} & \textbf{Full pass (s)} & \textbf{Optimized (s)} & \textbf{Speedup} \\
    \midrule
    1   & 0.0244 & 0.0241 & 1.01× \\
    5   & 0.1203 & 0.0401 & 3.00× \\
    10  & 0.2404 & 0.0600 & 4.01× \\
    15  & 0.3672 & 0.0804 & 4.57× \\
    20  & 0.4829 & 0.0994 & 4.86× \\
    30  & 0.7279 & 0.1386 & 5.25× \\
    40  & 0.9715 & 0.1781 & 5.45× \\
    50  & 1.2144 & 0.2176 & 5.58× \\
    \bottomrule
  \end{tabular}
\end{table}
\paragraph{Variance concentration vs.\ number of passes}
Finally, we report in Table~\ref{tab:app_variance_convergence} how quickly the empirical variance of an ID sample  converges to the reference value as $M$ increases (averaged over 500 samples):
\begin{table}[H]
  \centering\scriptsize
  \caption{Convergence of average empirical variance vs.\ $M$ (500 samples).}
  \label{tab:app_variance_convergence}
  \begin{tabular}{c c c}
    \toprule
    \textbf{$M$} & \textbf{Avg.\ variance on CIFAR10} & \textbf{Avg.\ variance on ImageNet} \\
    \midrule
    10 & 0.2323 & 0.3372 \\
    15 & 0.2308 & 0.3303 \\
    20 & 0.2349 & 0.3217 \\
    25 & 0.2321 & 0.3144 \\
    30 & 0.2201 & 0.3113 \\
    35 & 0.2118 & 0.3098 \\
    40 & 0.2116 & 0.3107 \\
    45 & 0.2113 & 0.3104 \\
    50 & 0.2121 & 0.3110 \\
    \bottomrule
  \end{tabular}
\end{table}



\subsection{Von-Mises Fisher concentration on the unit hypersphere}
 Alternatively, to further validate that during inference ID data exhibit higher concentration on the unit hypersphere, we characterized this concentration on the unit hypersphere using a Von Mises-Fisher distribution. We trained our model using DropConnect rate $p= 0.5$
 then applied MC DropConnect during inference as described in Algorithm~\ref{alg:alg_test}. 
The density of the Von Mises–Fisher distribution is defined as follows:
\begin{equation}
    f_D(\mathbf{x)} := C_D(\kappa)\exp(\kappa \psi\mathbf{x)}, \  \forall \mathbf{x} \in S^{D-1}, 
\end{equation}
where $\|\psi\|=1, \kappa \geq 0$, and the normalization constant $C_D(\kappa)$ is equal to :
\begin{equation}
   C_D(\kappa) := \frac{\kappa^{D/2-1}}{(2\pi)^{D/2}I_{D/2-1}(\kappa)},
\end{equation}
where $I_v$ denotes the modified Bessel function. The greater the value of $\kappa$, the higher the distribution is concentrated around $\psi$.

We observe in Fig.~\ref{fig:vmf} that ID data is clustering more tightly than OoD data on the unit hypersphere, though this is not optimal due to the insufficient DropConnect rate. Consequently, the concentration parameter $\kappa$ may serve as a valuable metric for further analysis.
\begin{figure}[H]
    \centering
    \includegraphics[width=0.6\linewidth]{bmf.png}
    \caption{Concentration $\kappa$ is used as OoD Score.}
    \label{fig:vmf}
\end{figure}

\subsection{Fine-tuning details}
\label{sec:train_detail}
\begin{itemize}
    \item For toy datasets (MNIST) we used a multi-layer perceptron with a DropConnect layer in its third layer. The architecture is 784-256-256( DC)-128-10 with ReLU activations. We trained for 40 epochs using SGD with 0.01 learning rate and $1 \times 10^{-3}$ Weight Decay. 
    \item CIFAR10/CIFAR100: we fine-tuned a vanilla ResNet18 model (pretrained in PyTorch) with the first convolutional layer modified to use a $3 \times 3$ kernel. We trained for 200 epoch using SGD, 128 batch-size,and momentum of 0.9, with DropConnect rate of $0.9$ on the DC layer which is has the same structure and being fully connected to the penultimate layer. We used an initial learning rate of 0.1 with a cosine annealing scheduler and we applied standard data augmentation techniques : cropping, horizontal flipping.
    \item ImageNet: we fine-tuned a vanilla ResNet50 model (pretrained in PyTorch). We trained for 150 epoch using SGD, 128 batch-size,and momentum of 0.9, with DropConnect rate of $0.9$ on the DC layer which is has the same structure and being fully connected to the penultimate layer. We used an initial learning rate of 0.1 with a cosine annealing scheduler and we applied standard data augmentation techniques : cropping, horizontal flipping.

    \item For the ViT visualization, since it was pretrained on ImageNet, we had to fine-tune all layers to achieve high accuracy. We trained for $30$ epochs with a batch size of 64, SGD, used a weight decay of $5 \times 10^{-4}$, set the momentum to $0.9$, and applied Dropout with a probability of $0.5$ on the penultimate layer. We used an initial learning rate of 0.01 and we applied standard data augmentation techniques : cropping, horizontal flipping. 
\end{itemize}



\input{proof.tex}

\input{concentration_inequality}

\input{computation}

%\input{concentration.tex}

\end{document}
