\section{Additional Experimental Results} \label{app:suppl}

In the following Appendices, we include the standard deviation resulting from the randomness involved in selecting the reference dataset, as well as the averaged values.


\subsection{Ablation Studies} \label{app:abbl}

\paragraph{Unnormalized representation.} 

In the main manuscript, we normalize the representation provided by the embedding function before computing our NC and baselines: $h(\bx) / \lVert h(\bx) \rVert_2 \in \mathcal{S}^{d-1}$. 
Here we evaluate the performance of these methods using the original, unnormalized representations. 
For our $\mathsf{NC}_k$ and $\amk$, we use Euclidean distance as a distance metric in the representation space. For $\mathsf{Norm}$, we directly compute the $L_2$ norm of the representations. For $\mathsf{FV}$, we calculate the variance of the representations in this unnormalized representation space. For $\mathsf{LL}$, we fit a Gaussian mixture model and compute the log-likelihood of this model. 
We reproduce the results from the main manuscript's Table~\ref{tab:comparison} (ID) and Table~\ref{tab:transfer} (Transfer). The results are shown in Table~\ref{tab:comparison_unnormalized} and Table~\ref{tab:transfer_unnormalized}, respectively.

As shown in Table~\ref{tab:comparison_unnormalized} (ID), Table~\ref{tab:transfer_unnormalized} (Transfer), our method exhibits consistent and strong performance, even when applying directly to the unnormalized representation space. 
On the other hand, the baseline methods suffer from a significant performance reduction when paired with Euclidean distance. 
Our conjecture is as follows. As noted in our results and \cite{tack2020csi}, points with a larger $L_2$-norm tend to exhibit higher reliability.
This implies that reliable points are often located in the outer regions of the representation space, resulting in larger distances between points in those regions compared to points near the origin.
This observation, however, contradicts to the $\amone$ and $\mathsf{FV}$'s expectations (i.e., smaller value is presumed to indicate higher reliability). 
As a result, when these measures are applied to the unnormalized representations, they struggle to accurately reflect the representation reliability, potentially indicating negative correlations that defy their assumptions.

\paragraph{Ensemble sizes ($M$) and the number of neighbors ($k$).}

In the main manuscript, we present an ablation study  to investigate the impact of ensemble size $M$ and the number of neighbors $k$ on estimating the representation reliability (see Figure~\ref{fig::abl_km}). Here we provide additional ablation results in Figure~\ref{fig:abl_m_full} (for the impact of $M$) and Figure~\ref{fig:abl_k_full} (for the impact of $k$). 
These results cover various combinations of pre-training algorithms, model architectures, and datasets.
The Brier score is used in all experimental results to measure downstream task performance. 



\subsection{Quantifying the Reliability of Individual Embedding Functions} \label{app:indiv}
To confirm whether the $\mathsf{NC}$ score can also be applied to gauge the reliability of the individual embedding function $h \in \mathcal{H}$, we measured the correlation between the measures and the representation reliability estimated with Brier score (i.e., downstream accuracy), for each member of the ensemble.
The results, detailed in Table~\ref{tab:single} and Table~\ref{tab:trasnfer_single}, include the average correlation and its standard deviation across the individuals.
Since metrics such as $\amone$, $\mathsf{norm}$, and $\mathsf{LL}$ are inherently applicable to individual embedding functions without using the ensemble, we also report non-ensemble scores as well, marked with a superscript star (*).











\input{tab_comparison_appendix}
\input{tab_transfer_appendix}
\input{fig_ablation}

\input{tab_single}
