













\section{Computing Representation Reliability} \label{app:performance}
Recall that the representation reliability is defined by:
\begin{equation}
    \mathsf{Reli}(\bx^*; \cH, \mathcal{T}) \defined \sum\nolimits_{t \in \cT}{\myPerf{t}} ~ / ~ |\cT| \nonumber
\end{equation}
where $\mathcal{T}$ is the collection of downstream tasks and for each task $t \in \mathcal{T}$, an embedding function $h$ is taken uniformly at random from $\cH$ and a downstream predictor $g_{h,t}$ is (optimally) trained upon $h$ on $t$.
In this section, we provide some examples to compute representation reliability using standard uncertainty or accuracy measures using an ensemble of embedding functions $\{h_1, \cdots, h_M\}$ and the corresponding downstream predictors $\{g_{1,t}, \cdots, g_{M,t}\}$.

Let us denote this distribution of $h$: $\mathcal{P}_{\mathcal{H}}$. 
Consequently, in regression tasks, the negative variance of the predictive distribution can be utilized for a performance metric:
\begin{gather} 
\myPerf{t} := -\Var{g_{h,t} \circ h(\bx^*)} \nonumber \\
\quad\quad = -\EEE{h \sim \pH}{(g_{h,t} \circ h(\bx^*) - \EEE{h \sim \pH}{g_{h,t} \circ h(\bx^*)})^2}  \nonumber \\
\quad\quad \approx  -\frac{1}{M^2} \sum_{i < j} \Big( g_{i,t} \circ h_i(\mathbf{x}^*) - g_{j,t} \circ h_j(\mathbf{x}^*) \Big)^2 \label{eq:rep_reli_var}
\end{gather}
where $g_{i,t} \equiv g_{h_i,t}$.
In the context of multi-output tasks, the trace of the covariance matrix can be utilized, which is essentially the sum of variances across each output dimension. This approach aggregates the individual uncertainties of each output, providing a comprehensive measure of overall uncertainty in the multi-output setting.

While our theoretical analysis has focused on regression tasks using variance, for classification tasks, metrics like the Brier score or entropy are more natural choices. For a given downstream classification task $t$ with $C$ classes, the ground truth label of $\bx^*$ is represented by a one-hot vector: $y_{t}^* = ( y_{t[1]}^*, \cdots, y_{t[C]}^*)^T$ where $y_{t[c]}^* \in \{0, 1\}$.
The softmax output for class $c$ via the predictive function $g_i \circ h_i (\cdot)$ is denoted as $g_i \circ h_i (\cdot)_{[c]}$. %

The negative Brier score for this setup is calculated as follows:
\begin{gather}
\myPerf{t} := -\mathsf{Brier} (\EEE{h \sim \mathcal{P}_\mathcal{H}}{g_{h,t} \circ h(\bx^*)} ; y_{t}^*) \nonumber \\
\quad\quad  = -\sum_{c=1}^{C} \Big( \EEE{h \sim \mathcal{P}_\mathcal{H}}{g_{h,t} \circ h(\mathbf{x}^*)}_{[c]} - y_{t[c]}^* \Big)^2\\
\quad\quad  -\approx \sum_{c=1}^{C} \Big( \frac{1}{M} \sum_{i=1}^{M} g_{i,t} \circ h_i(\mathbf{x}^*)_{[c]} - y_{t[c]}^* \Big)^2
\end{gather}
and the negative entropy is given by:
\begin{gather}
\myPerf{t} := -\mathsf{Entropy} ( \EEE{h \sim \mathcal{P}_\mathcal{H}}{g_{h,t} \circ h(\mathbf{x}^*)} ) \nonumber \\
\quad\quad  \approx \sum_{c=1}^{C} \Big(\frac{1}{M} \sum_{i=1}^{M} g_{i,t} \circ h_i(\mathbf{x}^*)_{[c]} \Big) \log \Big(\frac{1}{M} \sum_{i=1}^{M} g_{i,t} \circ h_i(\mathbf{x}^*)_{[c]} \Big) .
\end{gather}

While establishing a rigorous relationship between uncertainty and predictive accuracy may pose challenges, empirical studies have demonstrated a strong correlation between the two measures.
As stated in Theorem~\ref{thm::nb_consistency}, a lower bound of the representation reliability measured by $\Perf{\cdot} \coloneqq -\Varr{}{\cdot}$, depicted in Equation~\eqref{eq:rep_reli_var}, is assured by the sum of the reference point's reliability along with its relative distance to the test point.
Consequently, this explains how the $\mathsf{NC}$ score could effectively capture representation reliability, irrespective of the specific performance metrics employed.
