\section{More on Related Work} 
\label{app:summary}

Self-supervised representation learning has become standard in many computer vision applications.
Instead of training a neural network that takes in the raw data and outputs the target value (e.g., class label), it optimizes a neural network $h_\theta$ that maps an input $\bx$ into the latent vector $\bz \in \mathbb{R}^{\rdim}$ in the $\rdim$-dimensional representation space.

Uncertainty estimation is the process of figuring out how uncertain or reliable the learned representations of the data are.
Assessing the uncertainty of the neural network's representation is a key step in making a reliable machine learning framework.
This is because the uncertainty provides information about the data and how confident the model is in its modeling.
There are several ways to estimate uncertainty in deep learning, such as Bayesian approaches and ensembling, in supervised learning settings where the ground truth output (e.g., label) is given.
Estimating uncertainty in deep representation learning, on the other hand, is still a relatively undiscovered area of research.

\subsection{Uncertainty-aware Representation Learning}
The representation model is often considered to be deterministic in recently popular frameworks \citep{chen2020simple, he2020momentum, chen2020improved}.
To address the reliability issue of those frameworks, some recent works, including \citep{oh2018modeling, wu2020simple}, extend the prior deterministic frameworks to stochastic ones, allowing for the construction of an uncertainty-aware self-Supervised representation learning framework.

\citet{oh2018modeling} introduces a hedged instance embedding (HIB) that optimizes a representation network that approximates the distribution over the representation vector $p_\theta(\bz|\bx)$ under the (soft) contrastive loss \citep{hadsell2006dimensionality} and variational information bottleneck (VIB) principle \citep{alemi2016deep, achille2017emergence}.
More specifically, HIB encoder parameterizes the distribution as the mixture of $C$ Gaussians: $p_\theta(\bz|\bx) = \sum_{c=1}^{C} \mathcal{N}(\bz; \mu_\theta(\bx,c), \Sigma_\theta(\bx,c))$.
Based on the stochastic embedding, the paper proposes an uncertainty metric, called \emph{self-mismatch} probability:
\begin{equation}
    s_{self\_mismatch}(\bx^*) \defined 1 - p(m | \bx^*, \bx^*)
\end{equation}
where $p(m | \bx_1, \bx_2) \approx \int p(m | \bz_1, \bz_2)p_\theta(\bz_1 | \bx_1)p_\theta(\bz_2 | \bx_2) dz_1 d\bz_2$ and $p(m|\bz_1, \bz_2) \defined \sigma(-a \| \bz_1 - \bz_2\|_2 + b)$ based on their contrastive learning method.
Self-mismatch probability can be interpreted as an expectation of the distance between two points randomly sampled from the output distribution.
In other words, this uncertainty metric is based on the idea that an input with a large aleatoric uncertainty will span a wider region, resulting in a smaller $p(m|x,x)$.

In \citep{wu2020simple}, a similar extension is also proposed.
The paper introduces a distribution encoder that outputs the representation of Gaussian distribution with diagonal covariance matrix $\Sigma_\theta (x)$ and extends the normalized temperature-scaled cross-entropy loss (NT-Xent) \citep{chen2020simple} to distribution-level contrastive objective.
The norm of the covariance matrix determined by the distribution encoder is used to assess the reliability of a given input:
\begin{equation}
    s_{var}(x^*) \defined || \Sigma_\theta (x^*)||
\end{equation}

Despite the benefits of stochastic representation, there are still some shortcomings.
One limitation would be that it requires re-training.
Large models that have gotten a lot of attention lately are usually trained on a lot of data and are getting bigger, which means they take more time and computing power to train.
As a result, it may not always be practical or feasible for users to re-train a model.
Additionally, new training schemes can impose unexpected inductive bias to algorithms that are already working well.
For example, most probability-based methods are based on standard distributions like the Gaussian or a mixture of them.
However, these assumptions may reduce the effectiveness of the model or slow down the training procedure.


\subsection{Novelty Detection in Representation Space}
There are several studies that introduce ways to detect out-of-distribution (OOD) samples by determining the novelty of the data representation from a deterministic model \citep{lee2018simple, van2020uncertainty, tack2020csi, mirzae2022fake}.
Although the specific details of each technique vary, this study observed that these methods commonly use the relative distance information of the query data point's representation vector to other reference points:
\begin{equation}
    s_{d}(\bx^*) \defined \amk \Big( \big\{ \mathsf{dist} \big( h(\bx^*), h(\bx) \big) \mid \bx \in \bX_{\text{ref}} \big\} \Big)
\end{equation}
where $\amk$ outputs an average of the $k$ smallest relative distances in the representation space between the query $\bx^*$ and reference points $\bX_{\text{ref}}$, measured by the distance metric $\mathsf{dist}$.

As shown in the table, some works are designed for supervised learning schemes that require instance-specific training labels.
\citet{lee2018simple} and \citet{van2020uncertainty} construct reference points by empirical class means:
\begin{equation}
\Big\{ \boldsymbol{\hat{\mu}}_c = \frac{1}{N_c}\sum_{\bx_i \in \bX_{\text{ref}}}h_\theta(\bx_i) \mathbbm{1}(y_i=c) \Big\}_{c=1}^C
\end{equation}
where $\boldsymbol{\hat{\mu}}_c$ is a centroid of training representations of which the label $y_i$ is equal to $c$, $N_c$ is the number of training instances belonging to the label $c$, and $C$ is the number of classes.
Then, \citet{lee2018simple} defines an uncertainty score as a minimum Mahalanobis distance to each of the centroids using a tied empirical covariance matrix, whereas \citet{van2020uncertainty} calculates a distance using a Radial Basis Function (RBF) kernel.
These approaches can also be viewed as estimating the probability density of the representation to each class.

Nevertheless, the aforementioned methods necessitate training labels, which is not always feasible.
In addition, evaluating the uncertainty based on the training labels does not guarantee the method's efficacy, as downstream tasks frequently use distinct labeling schemes.
In order to estimate uncertainty in the absence of class label information, \citet{tack2020csi} measures the minimum distance between the query instance and all training instances in the representation space.
\citet{tack2020csi} additionally suggest to ensemble the uncertainty score with various transformations (i.e., augmentation) $\mathcal{T}$: $s_{d\mbox{-}ens}(\bx^*) = \frac{1}{|\mathcal{T}|} \sum_{t \sim \mathcal{T}} s_d(t(\bx^*))$.
Meanwhile, \citet{mirzae2022fake} averages the distance to $k$-nearest instances rather than taking the closest one to improve the effectiveness.

The main limitation of the above two approaches is the lack of theoretical justification for the proposed metric, as it is founded upon heuristic rules.
For example, as demonstrated by our empirical studies in Section~\ref{app:abbl}, a wrong selection of the distance metric can lead to contradictory results.
In addition, as depicted in Figure~\ref{fig:abl_k_full}, using a larger $k$ in those schemes does not necessarily ensure its effectiveness.
Considering that the primary goal of these studies is to deploy foundation models in safety-critical settings, establishing a robust reliability measure and analyzing its theoretical validity is vital.
