

\input{tab_comparison}
\section{Numerical Experiments}
\label{sec::exp}


\subsection{Experiment Setup} 
\label{sec:setup}








\paragraph{Embedding Function.} We use SimCLR \citep{chen2020simple}, Bootstrap Your Own Latent (BYOL) \citep{grill2020bootstrap}, and Momentum Contrast (MoCo) \citep{he2020momentum} to pre-train ResNet-18 and ResNet-50 models \citep{he2016deep} as the embedding functions, using a single NVIDIA V100 GPU. During the pre-training stage, we do not use any class label information. 
To construct the ensemble, we train a total of $M=10$ embedding functions using each pre-training algorithm and dataset with different random seed, following \citet{lakshminarayanan2017simple}.
The impact of $M$ to our experiment is studied in Section~\ref{sec:ablation}.



\paragraph{Baselines.} We compare our proposed method --- $\mathsf{NC}_k$ in Equation~\eqref{defn::NC} --- against the following widely-recognized baseline methods:
\begin{itemize}[leftmargin=2em]
    \item $\amk$ \citep{tack2020csi, mirzae2022fake}: the average of the $k$ minimum distances from the test point to the reference data in the representation space. A lower value of $\amk$ indicates higher reliability.
    
    \item $\mathsf{Norm}$ \citep{tack2020csi}: the $L_2$ norm of the representation $\lVert h(\bx^*) \rVert_2$. A higher value of $\mathsf{Norm}$ indicates higher reliability.
    
    \item $\mathsf{LL}$ \citep{ardeshir2022uncertainty}: the log-likelihood of the Gaussian mixture model ($\mathcal{R}^{d}$) or the von Mises–Fisher \citep{banerjee2005clustering} ($\mathcal{S}^{d-1}$) mixture model on the test point when fitted on the reference data. A higher value of $\mathsf{LL}$ indicates higher reliability.
    
    \item Feature Variance ($\mathsf{FV}$): representation consistency measured by $\Varr{i\sim [M]}{h_i(\bx^*)}$, as described in Theorem~\ref{thm::sl_count_exam_inf}.
    \rev{$\mathsf{FV}$ is extended from UQ in supervised learning. Specifically, the variance of neural networks’ predicted scores is often used to measure epistemic uncertainty \citep{kendall2017uncertainties, lakshminarayanan2017simple, ritter2018scalable}. Here we apply this measure to examine latent representation spaces.}
    A lower value of $\mathsf{FV}$ indicates higher reliability.
    
\end{itemize}

All baselines, except $\mathsf{FV}$, are based on a single embedding function. For a fair comparison, we consider the (point-wise) ensemble average of each score over different embedding functions for $\amk$, $\mathsf{Norm}$, and $\mathsf{LL}$.

\paragraph{Hyperparameters.} To reduce the computational cost, we randomly select $n = 5,000$ pre-training data as the reference dataset $\bX_{\text{ref}}$.
We repeat our experiments $5$ times with distinct random seeds for choosing reference data and report the average evaluation scores. Since the standard deviations of our scores are on average below 1\%, we defer the experimental results with error bars to Appendix. 
We choose $k=100$ for $\mathsf{NC}_k$ and $k=1$ for $\amk$ (see Section~\ref{sec:ablation} for an ablation study about the choice of $k$). 

For our method and $\amk$, we test both cosine distance and Euclidean distance as options for distance metric. In a similar manner, $\mathsf{LL}$ and $\mathsf{FV}$ are evaluated using both unnormalized $h(\bx) \in \Reals^d$ and normalized representations $h(\bx) / \lVert h(\bx) \rVert_2 \in \mathcal{S}^{d-1}$. 
\yj{
A discussion on the implications of these selections is provided in Section~\ref{sec:ablation}. Given the space limits, the results using normalized representations are reported in the main manuscript, while those using unnormalized representations are detailed in Appendix~\ref{app:abbl}.

}








    








\paragraph{Evaluation Protocol.} 
\hao{To evaluate the effectiveness of our method (i.e., $\mathsf{NC}_{100}$) in capturing the representation reliability (Definition~\ref{defn::rep^rel}) and to compare it against baselines\footnote{We use the negative scores of $\amk$ and $\mathsf{FV}$ since their lower values indicate higher reliability.}, we use \kendall to measure the correlation between each method and the representation reliability ($\mathsf{Reli}$).
We compute the downstream performance ($\mathsf{Perf}$) on each test point by using either 1) the negative predictive entropy, capturing downstream uncertainty; or 2) the negative Brier score, reflecting predictive accuracy.
}
The computation details are provided in Appendix~\ref{app:performance}.

We focus on downstream classification tasks. For each task, we freeze the pre-trained model and fine-tune the linear heads.
To leverage the multi-class labels and minimize the influence from the downstream training processes, we break down each $C$-class classification into a set of one-vs-one (OVO) binary classification tasks. 
As a result, the total number of downstream tasks is $|\cT| = C (C - 1) / 2$, with each data point being evaluated in $(C - 1)$ tasks.
Finally, we average the performance across all OVO tasks to compute the representation reliability.




\subsection{Main Results}


\subsubsection{Correlation to ID Downstream Performance}

\yj{Recall that the representation reliability relies on downstream tasks. 
Here we focus on in-distribution (ID) tasks where embedding functions are pre-trained on \texttt{CIFAR-10} or \texttt{CIFAR-100} datasets \citep{krizhevsky2009learning} without using any labeling information. Subsequently, they are fine-tuned with a linear head on the same dataset.
}

Table~\ref{tab:comparison} shows that our $\mathsf{NC}_{100}$ demonstrates a higher correlation with the representation reliability compared with baselines. Notably, regardless with the choice of the pre-training configurations or downstream tasks, our method always has a positive correlation whereas baselines occasionally demonstrates very low or even negative correlation. 
Moreover, $\mathsf{FV}$ presents low correlation and this observation aligns with the theoretical results in Section~\ref{sec:consistency}.

\rev{We offer insight into why $\mathsf{NC}_k$ has a more favorable performance compared with baselines. Imagine a test point (a fox image) has unreliable representations. It sits close to a reference point A (a dog image) and far from another point B (a cat image) in one representation space, while the opposite holds true in another representation space. In this case, $\mathsf{NC}_k$ would assign a low score due to this inconsistency. However, $\mathsf{Dist_k}$, $\mathsf{LL}$, and $\mathsf{Norm}$ suggest ``reliable'' since they rely on the relative distance of the test point to the closest reference point (or to the origin). $\mathsf{FV}$ may also indicate ``reliable'' if the representations of the test point remain consistent despite falling into clusters of dog or cat images in different representation spaces.
None of these baselines align different representation spaces before computing these relative distances. 
}

\input{tab_transfer.tex}
\subsubsection{Correlation to Transfer Learning Performance}

We conduct experiments with out-of-distribution tasks, particularly focusing on transfer learning tasks. We use the \texttt{TinyImagenet} dataset \citep{le2015tiny} as the source dataset for pre-training the embedding functions. Subsequently, we fine-tune and evaluate the embedding functions on three target datasets: \texttt{CIFAR-10}, \texttt{CIFAR-100}, and \texttt{STL-10} \citep{coates2011analysis}.
The correlation between each method and the representation reliability is reported in Table~\ref{tab:transfer}. 
As observed, NC consistently captures the representation reliability across a diverse range of settings.

\rev{We observe that $\mathsf{Norm}$ achieves comparable performance with NC. One possible explanation is that several factors (e.g., properties of embedding functions and characteristics of test points) influence the reliability of a test point's representation. In the context of transfer learning, OOD test points are more likely to be unreliable. The baseline we selected, $\mathsf{Norm}$, is good at identifying OOD samples \citep{tack2020csi}, thereby enabling it to capture representation reliability to some extent.
}






\input{tab_select}
\subsection{Use Case: Rank Pre-trained models}

In practice, there are typically several off-the-shelf pre-training models available for downstream tasks. These models may vary in architecture and are trained using different learning paradigms, making it challenging to decide which one to use. We apply our $\mathsf{NC}_{100}$ to aid in this selection process by ranking these models based on their average reliability scores. We extend the previous experiments on transfer learning scenarios as follows. For each data point, we rank the three pre-training models (SimCLR, BYOL, and MoCo)  using $\mathsf{NC}_{100}$ and baselines, respectively. Then we compute the correlation between these rankings and the actual ranking (based on average downstream tasks performance) by computing  \kendall score between the two rankings. We present the experimental results in Table~\ref{tab:select}.

\rev{Our NC demonstrates the second-best performance for ResNet-18 and the best performance for ResNet-50. In contrast, the baselines ($\amone$, $\mathsf{Norm}$, and $\mathsf{LL}$) exhibit negative scores when ranking embedding functions. The only exception is $\mathsf{FV}$ which achieves a comparable performance with NC, while demonstrating a low (or even negative) correlation in the previous experiments (Tables~\ref{tab:comparison} and \ref{tab:transfer}). 
One possible interpretation is that the primary issue with $\mathsf{FV}$ lies in its failure to align different representation spaces before comparing them. When ranking different embedding functions, the degrees of misalignment between their ensembles should roughly be of the same order, as these embedding functions share the same architecture. Consequently, this error term would cancel out when ranking these embedding functions.
}

\subsection{Quantifying the Reliability of Individual Embedding Functions} \label{sec:individual}
\hao{Our $\mathsf{NC}$ requires a set $\mathcal{H}$ of embedding functions, comparing the consistent neighbors of a test point across the representation spaces generated by these functions. When $\mathcal{H}$ has a single function $h$, we can still use our $\mathsf{NC}$ by applying the same learning algorithm (with varying random seeds) and data used to train $h$, yielding additional embedding functions to augment $\mathcal{H}$. The rationale behind this is that these embedding functions will possess similar inductive biases and generate representation spaces with comparable reliability. 
We validate this intuition in Table~\ref{tab:single} and Table~\ref{tab:trasnfer_single} in Appendix~\ref{app:indiv}. The results demonstrate that we can effectively predict the reliability of individual embedding functions using this approach.
}

\subsection{Ablation Studies} \label{sec:ablation}
\paragraph{Robustness to the Choice of Distance Metric.} 
We explore how various distance metrics within the representation space affect both our method and the baseline approaches. 
While Euclidean distance is a natural choice for the representations in $\mathbb{R}^d$, we also investigate cosine distance. This choice is motivated by the widespread use of cosine similarity in self-supervised algorithms, such as SimCLR, within their loss functions.

\hao{
We present the results in Table~\ref{tab:comparison_unnormalized} and \ref{tab:transfer_unnormalized} in Appendix~\ref{app:abbl}. Our key observation is: the baselines, including $\amone$ and $\mathsf{LL}$, are sensitive to the choice of distance metric and may even exhibit negative correlations with the representation reliability. In contrast, $\mathsf{NC}_{100}$ consistently demonstrates a positive correlation and ranks within the top $2$ among all baseline methods, regardless of the distance metric chosen.
}


\paragraph{Performance with a Smaller Ensemble Size ($M$).}
\yj{%
We conduct an ablation study to investigate the impact of ensemble size $M$ on estimating the representation reliability. The results in Figure~\ref{fig:n_ens18} and ~\ref{fig:n_ens50} and Appendix~\ref{app:suppl} show that increasing ensemble size improves the correlation scores. Nonetheless, our method generally outperforms baseline approaches, even with a small ensemble size of $M = 2$.
Exploring cost-effective ensemble construction methods, such as low-rank approximations \citep{wen2020batchensemble} or stochastic weight averaging \citep{izmailov2018averaging, maddox2019simple}, can be a promising future direction.}

\paragraph{Trade-off on the Number of Neighbors ($k$).}
As discussed in Section~\ref{sec:nb_consistency}, the choice of $k$ in Equation~\eqref{defn::NC} leads to a trade-off between having more consistent neighbors and preserving the overall reliability of those neighbors. In order to explore this trade-off, we conduct experiments with different values of $k \in \{1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\}$. The correlation between our method and the representation reliability is illustrated in Figure~\ref{fig:k18} and \ref{fig:k50} and Appendix~\ref{app:suppl}: it initially increases and then decreases as expected. We observe that the optimal performance could be achieved with $k$ around 100 (i.e., 2\% of $|\bX_{\text{ref}}|=5000$) for $\mathsf{NC}_k$ across different pre-training algorithms, model architectures, and downstream data.


\subsection{Key Observations \& Takeaways}
In summary, our main findings from the experiments are:
\begin{itemize}[leftmargin=1.0em]
    \item Our proposed method $\mathsf{NC}_{k}$ effectively captures the representation reliability and can help compare the reliability of different pre-trained models.
    
    \item More importantly, contrary to the baselines, $\mathsf{NC}_{k}$ \emph{consistently} exhibits a positive correlation with the representation reliability across all different settings. Although the baselines may occasionally surpass $\mathsf{NC}_{k}$, their performance fluctuates significantly across different settings and sometimes becomes even negative, introducing a risk when used to assess reliability in safety-critical settings.
\end{itemize}

\section{Final Remarks and Limitations}

Self-supervised learning is increasingly used for training general-purpose embedding functions that can be adapted to various downstream tasks. In this paper, we presented a systematic study to evaluate the quality of representations assigned by the embedding functions. We introduced a mathematical definition of representation reliability, demonstrated that existing UQ frameworks in supervised learning cannot be directly applied to estimate uncertainty in representation spaces, derived an estimate for the representation reliability, and validated our estimate through extensive numerical experiments.


There is a crucial need for future research to investigate and ensure the responsibility and trustworthiness of pre-trained self-supervised models. For example, representations should be interpretable and not compromise private information. Moreover, the embedding functions should exhibit robustness against adversarial attacks and incorporate a notion of uncertainty, in addition to their abstract representations. This work takes an initial step towards understanding the uncertainty of the representations. In the case where a downstream model fails to deliver a desirable output for a test point, the representation reliability can provide valuable insight into whether the mistake is due to unreliable representations or downstream heads.


There are several future directions that are worth further exploration. For example, our current method for estimating the representation reliability uses a set of embedding functions to compute neighborhood consistency. It would be interesting to investigate whether our approach can be expanded to avoid the need for training multiple embedding functions. This could potentially be achieved through techniques such as MC dropout or adding random noise to the neural network parameters in order to perturb them slightly. Additionally, while we currently assess the representation reliability through downstream prediction tasks, it would be valuable to investigate the extension of our definition to cover a broader range of downstream tasks. 









\begin{figure}[t] 
    \centering
    \subfigure[]
    {
        \label{fig:n_ens18}
        \includegraphics[width=0.47\columnwidth]{figures/ablation_m_simclr_resnet18_cifar100.png}
    }
    \subfigure[]
    {
        \label{fig:n_ens50}
        \includegraphics[width=0.47\columnwidth]{figures/ablation_m_simclr_resnet50_cifar100.png}
    }
    \subfigure[]
    {   
        \label{fig:k18}
        \includegraphics[width=0.47\columnwidth]{figures/ablation_k_simclr_resnet18_cifar100.png}
    }
    \subfigure[]
    {   
        \label{fig:k50}
        \includegraphics[width=0.47\columnwidth]{figures/ablation_k_simclr_resnet50_cifar100.png}
    }
    \caption{Ablation studies on the ensemble size ($M$) and the number of neighbors ($k$) for $\mathsf{NC}_k$ (ours) and baselines. Brier score is used for the downstream performance metric. The comprehensive results can be found in Appendix~\ref{app:abbl}.}
    \label{fig::abl_km}
\end{figure}
