
\clearpage
\section{Further Discussion on Neighborhood Consistency} 
\label{app:alternative}

In this section, we present an alternative viewpoint to clarify the underlying motivation behind neighborhood consistency and demonstrate how it addresses the shortcomings of existing supervised learning frameworks.
As illustrated in Section~\ref{sec:bg}, the uncertainty in supervised learning could be expressed by the variance of predictions across various functions. 
It is important to note that this approach is applicable since different functions transform the input into the "same" output space and the distance/similarity within the output space is well-defined.

From the perspective of unsupervised learning, however, there is no "unique" ground truth representation space which makes it difficult to compare the representations constructed by different embedding functions.
Therefore, to properly compare the two distinct abstract representations $h_i(\bx^*)$ and $h_j(\bx^*)$, we need a transformation function $\phi(\cdot)$ that maps different representations into a comparable form.
Then, following Equation~\eqref{eq:unc_sup}, the consistency across the set of representations $\cH = \{h_1(\bx^*), \cdots, h_M(\bx^*) \}$ can be assessed as we do in :
\begin{equation} \label{eq:repr_cs}
    \mathsf{Unc}(\bx^* ; \cH) = \frac{1}{M^2} \sum_{i < j} \Sim{\phi(\bz^*_i), \phi(\bz^*_j})
\end{equation}
where $\phi(\bz^*_i)$ can be viewed as a \emph{surrogate representation} of $\bz^*_i = h_i(\bx^*)$.

What characteristics should an appropriate $\phi(\cdot)$ have?
First and foremost, it would require an anchor that connects different representation spaces, and this paper suggests using (reliable) "pre-training data" as such anchors.
As such, we first introduce a surrogate vector representation consisting of the relative distances from the test point to each reference point:
\begin{equation}
    \phi_\text{rel-fc}(\bz_i^* ; \bX_{\text{ref}}) \defined \Big[ \mathsf{dist} \big( \bz_i^*, \bz_i^{(1)} \big), \cdots, \mathsf{dist} \big( \bz_i^*, \bz_i^{(n)} \big) \Big]^T
\end{equation}
where $\bz_i^{(l)} = h_i(\bx^{(l)})$ and $\mathsf{dist}(\cdot)$ is a distance metric in the representation space (e.g., cosine distance).
This surrogate representation captures the relational information to reference data points rather than taking account for the absolute location of the test representation.


The aforementioned surrogate, however, can be burdened with an excessive amount of information.
To address the issue, we may consider a \emph{sparsified representation} as in following:
\begin{equation} \label{eq:sparse}
    \phi_\text{rel-sparse}(\bz_i^* ; \bX_{\text{ref}}) \defined \Big[ \mathbbm{1} \big( \mathsf{dist} \big( \bz_i^*, \bz_i^{(1)} \big) \le \epsilon \big) , \cdots, \mathbbm{1} \big( \mathsf{dist} \big( \bz_i^*, \bz_i^{(n)} \big) \le \epsilon \big) \Big]^T
\end{equation}
where $\mathbbm{1}$ is a indicator function; and $\epsilon \in \Reals$ is a small real number.
The equivalent set notation of the vector representation is: $\epsilon\text{-NN}_i(\bx^*) \defined \{ l \mid \mathsf{dist}\big( h_i(\bx^*), ~ h_i(\bx^{(l)}) \big) \le \epsilon \}$.
The corresponding neighborhood consistency measure can be computed as follows:
\begin{equation}
    \mathsf{NC}_{\epsilon}(\bx^*) \defined \frac{1}{M^2} \sum_{i < j} \Sim{\epsilon\text{-NN}_i \big( \bx^* \big), ~ \epsilon\text{-NN}_j \big( \bx^*\big)} .
\end{equation}

However, selecting an appropriate value for $\epsilon$ can be challenging since determining the proper scale of the representation space is not straightforward. As a result, we introduce a scale-free version of neighborhood consistency that relies on the local $k$-nearest neighborhood:
\begin{equation}
    \mathsf{NC}_{k}(\bx^*) \defined \frac{1}{M^2} \sum_{i < j} \Sim{k\text{-NN}_i \big( \bx^* \big), ~ k\text{-NN}_j \big( \bx^*\big) }
\end{equation}
where $k\text{-NN}_i(\bx^*)$ is the index set of $k$ nearest neighbors of $h_i(\bx^*)$ among $\bZ_{i, \text{ref}}$.

For a set similarity metric $\mathsf{Sim}$, we can use either \emph{Jaccard Similarity} or \emph{Overlap Coefficient}:
\begin{align}
    \text{Jaccard Similarity}(S_1, S_2) &\defined \frac{\lvert S_1 \cap S_2 \rvert}{\lvert S_1 \cup S_2 \rvert} \\
    \text{Overlap Coefficient}(S_1, S_2) &\defined \frac{\lvert S_1 \cap S_2 \rvert}{\min (\lvert S_1\rvert, \lvert S_2\rvert)} .
\end{align}
If the size of $S_1$ and $S_2$ are equal to $k$: $\lvert S_1 \cup S_2 \rvert = 2k - \lvert S_1 \cap S_2 \rvert$ and $\min (\lvert S_1\rvert, \lvert S_2\rvert) = k$.
Thus, regardless of the selection, both similarity metrics solely depends on $\lvert S_1 \cap S_2 \rvert$ when the $k$-NN approach is used to determine the neighbors.
Additionally, given a set of test points, both metrics provide the same order (i.e., rank) in terms of their representation reliability.













