\section{Background: Uncertainty Quantification in Supervised Learning} 
\label{sec:bg}




We recall a Bayesian-inference view of epistemic uncertainty in supervised learning. Consider predictive models, where each model (e.g., a neural network) $f$ outputs a predictive probability $p(y|\bx, f)$ for an input variable $\bx$. In the Bayesian framework, a prior probability distribution $p(f)$ is first introduced and a posterior distribution is learned given a training dataset $\mathcal{D}$: $p(f | \cD) \propto p(\cD | f) \cdot p(f)$. For a new test point $\bx^*$, its  posterior predictive distribution is obtained by averaging the predictive probabilities over models: 
\begin{equation}
\label{eq::pred_post}
    p(\hat{y}^* \mid \bx^*, \cD) = \int p(\hat{y}^* \mid \bx^*, f) p(f \mid \cD) \dif f.
\end{equation}

Since the posterior distribution does not have an analytical expression for complex neural networks, Monte Carlo approaches could be used to approximate the integral in Equation~\eqref{eq::pred_post}.
One of the most prominent approaches is deep ensembles \citep{lakshminarayanan2017simple}, in which a class of neural networks $\mathcal{F}$ are trained with $M$ different random initialization of the learnable parameters $\{\theta_i\}_{i=1}^M$.
The posterior predictive distribution is then approximated by:
\begin{equation}
p(\hat{y}^* \mid \bx^*, \cD) \approx \frac{1}{M} \sum_{i=1}^{M} \big[ \delta(\hat{y}^* - f_{\theta_i}(\bx^*)) \big].
\end{equation}

Finally, the uncertainty can be assessed by the \emph{inconsistency} of the output across sampled functions, which can be quantified by metrics such as variance\footnote{For a random vector, its variance is defined as the trace of its covariance matrix.} (for regression tasks) or entropy (for classification tasks).
For example, in regression tasks, the uncertainty can be estimated by
\begin{align}
    & \Unc{\bx^* \mid \cD} = \Var{\hat{y}^* \mid \bx^*, \cD} \label{eq:unc_sup} \\ 
    &\approx \Varr{i\sim[M]}{f_{\theta_i}(\bx^*)} 
    = \frac{1}{M^2} \sum_{i < j} \lVert f_{\theta_i}(\bx^*) - f_{\theta_j}(\bx^*) \rVert_2^2 \nonumber,
\end{align}
where $\Varr{i\sim[M]}{\cdot}$ is the population variance over index $i \in \{1, \cdots, M\}$.
When different models disagree on their predictions, it suggests a high level of uncertainty. Conversely, if multiple predictive models give similar outputs, the prediction can be considered certain. 
However, this principle does not hold in self-supervised learning, as representations do not have a ground truth and different pre-trained models can carry different semantic meanings.















