%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{multirow}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{wrapfig}
\usepackage{wrapfig}
\usepackage{float}
\floatstyle{plaintop}
\restylefloat{table}
\usepackage{url} 
\graphicspath{ {figure/} }
\usepackage{tikz}
\usetikzlibrary{shapes,arrows}

\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{graphicx}
% algorithm
\usepackage{algorithm}
\usepackage{algpseudocode}
% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}
\usepackage{caption}
\usepackage{subcaption}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
%% Provided macros
%  er: Because the class footnote size is essentially LaTeX's  ,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Diversity-enhanced Probabilistic Ensemble For
Uncertainty Estimation\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<wangh36@rpi.edu>}{Hanjing Wang}}
\author[1]{\href{mailto:<jiq@rpi.edu>}{Qiang Ji}}
% Add affiliations after the authors
\affil[1]{%
   ECSE\\
    Rensselaer Polytechnic Institute\\
    Troy, New York, USA
}
  
  \begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle


\appendix
\section{Preliminaries}

\subsection{Laplacian Approximation}

The LA approximates the true posterior of parameters by a Gaussian distribution, i.e., $p(\mathbf{\theta}|\mathcal{D},\beta) \approx \mathcal{N}(\mathbf{\theta}_{map}, \Sigma)$
where $\Sigma=-(H)^{-1}$ and $H=\nabla_{\mathbf{\theta}}^2 \log p(\mathbf{\theta}|\mathcal{D},\beta)|_{\mathbf{\theta} = \mathbf{\theta}_{map}}$. 

Efficiently and accurately calculating the Hessian matrix $H$ is the key of LA. Given a standard Gaussian distribution prior $p(\mathbf{\theta}|\beta)=\mathcal{N}(\mathbf{0},\beta^2 I)$ where $\beta$ is the hyperparameter, we can obtain that
\begin{equation}
\label{LA_posterior}
    \begin{split}
        \nabla_{\mathbf{\theta}}^2 \log p(\mathbf{\theta}|\mathcal{D},\beta) =& \nabla_{\mathbf{\theta}}^2 \log p(\mathcal{D}|\mathbf{\theta}) + \nabla_{\mathbf{\theta}}^2 \log p(\mathbf{\theta}|\beta)\\
        =& \sum_{(x,y)\in \mathcal{D}} \nabla_{\mathbf{\mathbf{\theta}}}^2 \log p(y|x,\mathbf{\theta}) +\frac{1}{\beta^2} I 
    \end{split}
\end{equation}
where $I$ is the identity matrix. Basically, computing the second-order derivatives for highly nonlinear neural networks is hard and we leverage the Generalized Gauss-Newton Matrix (GGN) \citep{schraudolph2002fast} to approximate $\nabla_{\mathbf{\theta}}^2 \log p(y|x,\mathbf{\theta})$. Denote the neural network output as $f(x, \mathbf{\theta})$ in general.
\begin{equation}
\label{GGN}
\begin{split}
    \nabla_{\mathbf{\theta}}^2 \log p(y|x,\mathbf{\theta}) =& \nabla_{\theta}^2 \log p(y|f(x,\mathbf{\theta}))\\
    \approx& J(x) \nabla_{f}^2 p(y|f(x,\mathbf{\theta})) J(x)^T
\end{split}
\end{equation}
where $J(x)=\nabla_{\mathbf{\theta}} f(x,\mathbf{\theta})$ is the Jacobian matrix. However, the large matrix multiplication in Eq.~\eqref{GGN} may also lead to problems, especially for deep learning models. We use the last-layer Laplacian approximation proposed by \cite{kristiadi2020being}, which constructs the posterior approximation only for neural networks' last-layer weights to reduce computational complexity. We use the full Hessian matrix without additional factorization assumptions. To avoid tuning the hyperparameter $\beta$, we utilize the marginal likelihood maximization method proposed by \cite{Ritter_ICLR18_laplace} to do a one-parameter optimization for $\beta$. The loss function is the posterior predictive approximated by LA.
\begin{equation}
    \beta^* = \arg \max_{\beta} \sum_{(x,y)\in \mathcal{D}} \log p(y|x,\mathcal{D})
\end{equation}
After we compute the Laplacian approximation $\mathcal{N}(\mathbf{\theta}_{map}, \Sigma)$, we can perform the Bayesian inference in Eq. \eqref{la_inference}. Given a new pair of input $(x^*,y^*)$,
\begin{equation}
\label{la_inference}
 p(y^*|x^*,\mathcal{D}) = \int p(y^*|x^*,\mathbf{\theta})p(\theta|\mathcal{D},\beta)d\mathbf{\theta} 
 %& \approx \int p(y^*|x^*,\theta)\mathcal{N}(\theta;\theta_{map}, \Sigma)d\theta   \\
 \approx \int softmax(f(x,\mathbf{\theta})) \mathcal{N}(\mathbf{\theta};\mathbf{\theta}_{map}, \Sigma)d\mathbf{\theta}  
\end{equation}
where $softmax(f)=\frac{exp(f)}{\sum_j exp(f_j)}$ is the softmax function and Eq. \eqref{la_inference} can be solved either by MC sample average or by probit approximation. Performing the first-order Taylor expansion of $f(x,\mathbf{\theta})$ with respect to $\mathbf{\theta}$ at $\mathbf{\theta}_{map}$ yields $f(x,\mathbf{\theta})\approx f(x,\mathbf{\theta}_{map})+J(x)^T(\mathbf{\theta} - \mathbf{\theta}_{map})$, which indicates that $f(x,\mathbf{\theta})\sim \mathcal{N}(f(x,\theta_{map}),\Sigma^f)$ where $\Sigma^f=J(x)^T\Sigma J(x)\in \mathcal{R}^{C\times C}$. Based on probit approximation,
\begin{equation}
    \label{probit_approximation}
     p(y^*=c|x^*,\mathcal{D}) = \frac{exp(\tau^{(c)}(x))}{\sum_j exp(\tau^{(j)}(x)) } ~~\text{where}~~ \tau^{(j)}(x)=\frac{f^{(j)}(x,\mathbf{\theta})}{\sqrt{1+\frac{\pi}{8} \Sigma^f_{jj}}}
\end{equation}
where $f^{(j)}(x,\mathbf{\theta})\in \mathcal{R}$ is the $j$th element of $f(x,\mathbf{\theta})$ and $\Sigma^f_{jj}$ is the $(j,j)$th element of $\Sigma^f$.

\subsection{Uncertainty Quantification} 

For classification problems, we estimate the epistemic uncertainty and the aleatoric uncertainty by the mutual information and the expected entropy \citep{depeweg2018decomposition}. 

\begin{equation}\label{eq:total_entropy}
  \underbrace{\mathcal{H}\left[\mathrm{p}(y | x, \mathcal{D}, \beta)\right]}_{\text {Total Uncertainty }}
  =\underbrace{\mathcal{I}\left[y, \theta | x,\mathcal{D}, \beta\right]}_{\text {Epistemic Uncertainty }}
  +\underbrace{\mathbb{E}_{\mathrm{p}\left({\theta} | \mathcal{D}, \beta\right)}\big[\mathcal{H}[\mathrm{p}(y | x, \theta)]\big]}_{\text {Aleatoric Uncertainty}}
\end{equation}
where $\mathcal{H}$ and $\mathcal{I}$ represent the entropy and mutual information, respectively. More specifically,
\begin{equation}
    \label{total uncertainty}
    \begin{split}
        &\mathcal{H}\left[p(y | x, \mathcal{D}, \beta)\right]=\mathcal{H}\left[E_{p(\theta|D,\beta)}[p(y| x, \theta)]\right] 
        \approx  \mathcal{H}\left[\frac{1}{S} \sum_{s=1}^S p(y|x,\theta^s)\right]\\
        &\mathbb{E}_{p\left({\theta} | \mathcal{D}, \beta\right)}\big[\mathcal{H}[p(y | x, \theta)]\big]\approx \frac{1}{S} \sum_{s=1}^S \mathcal{H}(p(y|x,\theta^s))  
    \end{split}
\end{equation}
where $\theta^s \sim p\left({\theta} | \mathcal{D}, \beta\right) \approx \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i},\Sigma_i)$ for the probabilistic ensemble model.

\section{Probabilistic Ensemble Propositions}

\subsection{Proof of Proposition 3.1}
\begin{proof}
We first introduce the Bernstein-von Mises theorem.
\begin{lemma}[Bernstein-von Mises theorem for Laplacian approximation of the posterior distribution \citep{kleijn2012bernstein,gelman2011induction}]\label{Bernstein-von Mises}
Under mild regularity conditions (i.e., the likelihood function of $\theta$ is continuous, $\sum_{i=1}^N \lambda_i \theta_i$ is not on the boundary of the parameter space.), as the sample size $M\rightarrow \infty$, the posterior distribution of $\theta$ approaches its Laplacian approximation $ \mathcal{N}(\theta;\theta_{map},\Sigma)$. For example,
\begin{equation}
    \label{Bernstein-von Mises equation}
    \sup_{\theta} |p(\theta|\mathcal{D},\beta) - \mathcal{N}(\theta;\theta_{map}, \Sigma))| \rightarrow 0
\end{equation}
\end{lemma}
Then for the probabilistic ensemble model, $\theta \sim \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)$ 
\begin{equation}
    \label{proof_convergence}
    \begin{split}
        &\sup_{\theta} \left|p(\theta|\mathcal{D},\beta) - \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)\right| =  \sup_{\theta} \left| \sum_{i=1}^N \lambda_i [p(\theta|\mathcal{D},\beta) -\mathcal{N}(\theta;\theta_{i}, \Sigma_i)]\right| \\
        & \leq \sum_{i=1}^N \lambda_i \sup_{\theta}\left|p(\theta|\mathcal{D},\beta) -\mathcal{N}(\theta;\theta_{i}, \Sigma_i)\right| \rightarrow 0
    \end{split}
\end{equation}
\end{proof} 

\subsection{Proof of Proposition 3.2}
\begin{proof}
The proposed probabilistic ensemble can be an approximate Bayesian method where the Laplacian approximation bridges the connection of randomization-based ensembles and the Bayesian posterior distribution. Given a set of coefficients $\{\lambda_i\}_{i=1}^N$ where $\lambda_i>0$ and $\sum_{i=1}^N \lambda_i=1$,
\begin{equation}
\label{appro_Bayesian_PE}
    p(\theta|\mathcal{D},\beta) = \sum_{i=1}^N \lambda_i p(\theta|\mathcal{D},\beta) \approx   \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)
\end{equation}
Eq.~\eqref{appro_Bayesian_PE} holds since the Laplacian approximation $\mathcal{N}(\theta;\theta_{i}, \Sigma_i)$ of the $i$th ensemble model serves independently as an approximation of $p(\theta|\mathcal{D},\beta)$. Instead of treating the deep ensemble method as non-Bayesian, we argue that it is necessary to construct the relationship of the deep ensembles with the parameter posterior. The vanilla approximation of posterior for the deep ensemble method can be expressed as $p_{DE}(\theta) = \sum_{i=1}^N \lambda_i \delta(\theta,\theta_i)$ where $\delta(\theta,\theta_i)$ is the delta function that returns 1 if and only if $\theta = \theta_i$ and returns 0 otherwise. Since $p_{DE}(\theta)$ is a discrete distribution, there might be a big gap between $p_{DE}(\theta)$ and $p(\theta|\mathcal{D},\beta)$ when $\theta \not\in \{\theta_i\}_{i=1}^N$. For example, the KL divergence between $p(\theta|\mathcal{D},\beta)$ and $p_{DE}(\theta)$ is shown in Eq.~\eqref{KL_DE}.
\begin{equation}
\begin{split}
    \label{KL_DE}
    KL(p(\theta|\mathcal{D},\beta)||\sum_{i=1}^N \lambda_i \delta(\theta,\theta_i))=  -\mathcal{H}(p(\theta|\mathcal{D},\beta)) - \int p(\theta|\mathcal{D},\beta) \log \sum_{i=1}^N \lambda_i \delta(\theta,\theta_i) d \theta
\end{split}
\end{equation}
We can observe that $KL(p(\theta|\mathcal{D},\beta)||\sum_{i=1}^N \lambda_i \delta(\theta,\theta_i))$ could be extremely large since $\log \sum_{i=1}^N \lambda_i \delta(\theta,\theta_i) \rightarrow -\infty$ when $\theta \not\in \{\theta_i\}_{i=1}^N$. It is mainly because the vanilla approximation does not explore the possible values other than the modes. Given a limited number of modes, $p_{DE}(\theta)$ can be used for a Bayesian prediction but is hard to sketch the complex posterior distribution. In contrast, the PE model extends the deep ensemble method for approximate Bayesian inference through exploring each ensemble subspace, enabling a better posterior approximation.

Then, we show that the KL divergence between $p(\theta|\mathcal{D},\beta)$ and $\sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)$ is reduced compared to single-network LA. Based on Jensen's inequality and the convexity of $-\log$, we have that
\begin{equation}
    \label{KL_pe_posterior}
    \begin{split}
        KL(p(\theta|\mathcal{D},\beta)||\sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)) &= -\mathcal{H}(p(\theta|\mathcal{D},\beta)) - \int p(\theta|\mathcal{D},\beta) \log \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i) d\theta \\
        &\leq -\mathcal{H}(p(\theta|\mathcal{D},\beta)) -\sum_{i=1}^N \lambda_i\int p(\theta|\mathcal{D},\beta) \log \mathcal{N}(\theta;\theta_{i}, \Sigma_i) d\theta \\
        & = \sum_{i=1}^N \lambda_i KL(p(\theta|\mathcal{D},\beta)||\mathcal{N}(\theta;\theta_{i}, \Sigma_i))
    \end{split}
\end{equation}
% (2) Then, we compare the probabilistic ensemble with the deep ensemble method. Given $N$ deterministic neural networks parameterized by $\{\theta_i\}_{i=1}^N$, the density function of the deep ensemble method can be expressed as 
% \begin{equation}
%     \label{pdf_ensemble}
%     p_{DE}(\theta) = \sum_{i=1}^N \lambda_i \delta(\theta,\theta_i)
% \end{equation}
% where $\delta(\theta,\theta_i)$ is the delta function that returns 1 if and only if $\theta = \theta_i$ and returns 0 otherwise. For the probabilistic ensemble model, 
% \begin{equation}
%     \label{KL_esb_posterior}
%     \begin{split}
%         KL(p(\theta|\mathcal{D},\beta)||\sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)) &= -\mathcal{H}(p(\theta|\mathcal{D},\beta)) - \int p(\theta|\mathcal{D},\beta) \log \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i) d\theta \\
%         &\leq -\mathcal{H}(p(\theta|\mathcal{D},\beta)) - \int p(\theta|\mathcal{D},\beta) \log \sum_{i=1}^N \lambda_i \delta(\theta,\theta_i) d \theta \\
%         & = KL(p(\theta|\mathcal{D},\beta)||\sum_{i=1}^N \lambda_i \delta(\theta,\theta_i))
%     \end{split}
% \end{equation}
% which is mainly because $\log \sum_{i=1}^N \lambda_i \delta(\theta,\theta_i) \rightarrow -\infty$ when $\theta \not\in \{\theta_i\}_{i=1}^N$
\end{proof}


\subsection{Proof of Proposition 3.3}
\begin{proof}
At the beginning of the proof, we introduce a lemma based on \cite{liao2018sharpening}.
\begin{lemma}[Sharpening Jensen's inequality \citep{liao2018sharpening}]\label{lemma1}
Consider a convex function $\phi(\cdot)$ and a scalar random variable $z\in [a,b]$. $a,b$ are finite real numbers. Let $\mu = E[z]$ and denote $r(z)$ as the residual term for the first-order Taylor expansion of $\phi(z)$ at $\mu$, i.e.,
\begin{equation}
    \phi(z)=\phi(\mu) +\phi^{'}(\mu)(z-\mu) +r(z)
\end{equation}
There must exist a finite number $C_{min}$ such that 
\begin{equation}
    \mathbb{E}[\phi(z)]-\phi(\mathbb{E}[z])\geq C_{min}V(z)
\end{equation}
where $V(z)$ is the variance of $z$. Especially, $C_{min}\geq \inf_{z\in[a,b]}\frac{\phi^{"}(z)}{2}$ indicates
\begin{equation}
    \mathbb{E}[\phi(z)]-\phi(\mathbb{E}[z])\geq \inf_{z\in[a,b]}\frac{\phi^{"}(z)}{2}V(z)
\end{equation}
\end{lemma}
For Proposition 3.3, let $\phi(z)=-\log z$ which is a convex function. Given an input $x$ and the groundtruth label $y\in \{1,2,...,C\}$, let $z=p(y|x,\theta)\in [0,1]$ where $\theta$ are the probabilistic ensemble random parameters that follow a mixture of Gaussian distribution. Following lemma \ref{lemma1}, we have the following lower bound for the Jensen's inequality gap.
\begin{equation} \label{proof_theorem1}
         \mathbb{E}_{\theta}[- \log p(y|x,\theta)]-[-\log \mathbb{E}_{\theta}
    [p(y|x,\theta)]] \geq \inf_{\theta}\frac{1}{2p(y|x,\theta)^2} \mathbb{V}_{\theta}[p(y|x,\theta)]
\end{equation}
\end{proof}



\subsection{Proof of Proposition 3.4}

\begin{proof}
First, the vanilla approximation of posterior for the deep ensemble method can be expressed as $p_{DE}(\theta) = \sum_{i=1}^N \lambda_i \delta(\theta,\theta_i)$ where $\delta(\theta,\theta_i)$ is the delta function that returns 1 if and only if $\theta = \theta_i$ and returns 0 otherwise. The mean and variance based on $p_{DE}(\theta)$ are shown in Eq.~\eqref{mean_varaince_DE}.
\begin{equation}\label{mean_varaince_DE}
\begin{split}
    & \mu_{D}=\mathbb{E}_{\theta \sim p_{DE}(\theta)}[\theta] = \sum_{i=1}^N \lambda_i \theta_i \\
    & \Sigma_{D}=Cov_{\theta \sim p_{DE}(\theta)}[\theta] = \mathbb{E}_{\theta \sim p_{DE}(\theta)}[\theta \theta^T] - \mu\mu^T = \sum_{i=1}^N \lambda_i \theta_i\theta_i^T - \mu_{D} \mu_{D}^T
\end{split}
\end{equation}
For the probabilistic ensemble $\theta \sim \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)$, we have that 
\begin{equation}\label{mean_varaince_PE}
\begin{split}
    & \mu_{P}=\mathbb{E}_{\theta \sim p_{PE}(\theta)}[\theta] = \sum_{i=1}^N \lambda_i \theta_i = \mu_{D} \\
    & \Sigma_{P}=Cov_{\theta \sim p_{PE}(\theta)}[\theta] = \mathbb{E}_{\theta \sim p_{PE}(\theta)}[\theta \theta^T] - \mu\mu^T  = \sum_{i=1}^N \lambda_i \mathbb{E}_{\theta \sim \mathcal{N}(\theta;\theta_{i}, \Sigma_i)}[\theta\theta^T] - \mu_{P} \mu_{P}^T\\
    &~~~~~=\sum_{i=1}^N \lambda_i (\theta_i \theta_i^T+ \Sigma_i)-\mu_{P} \mu_{P}^T = \Sigma_{D}+\sum_{i=1}^N \lambda_i \Sigma_i \geq \Sigma_{D}
\end{split}
\end{equation}
where $\Sigma_{P}\geq \Sigma_{D}$ means $\Sigma_{P}- \Sigma_{D}$ is positive semi-definite. Eq. \eqref{mean_varaince_PE} shows that the probabilistic ensemble model has better diversity in terms of variance. 
\end{proof}

\subsection{Proof of Proposition 3.5}
\begin{proof}
In the beginning, we introduce three lemmas.
\begin{lemma}[From \citep{hein2019relu}. This is also stated in Lemma A.1 in \citep{kristiadi2020being}]\label{lemma_35_1}  Denote $\{Q_i\}_{i=1}^R$ be the set of linear regions associated to the ReLU network $f: \mathcal{R}^{|x|} \rightarrow \mathcal{R}^C$. For any $x\in \mathcal{R}^{|x|}$, there exists an $\alpha>0$ and $t\in\{1,2,...,R\}$ such that $\delta x \in Q_t$ for all $\delta \geq \alpha$. Furthermore, the restriction of $f$ to $Q_t$ can be written as an affine function $W^T x + q$ for some suitable $W \in \mathcal{R}^{|x| \times C}$ and $q \in \mathcal{R}^C$.
\end{lemma}
\begin{lemma}[From Lemma A.2 in \citep{kristiadi2020being}]\label{lemma_35_2}  
Let $A\in \mathcal{R}^{d_1 \times d_2}$ and $z \in \mathcal{R}^{d_1}$ with $d_1 \geq d_2$, then we have $||Az||^2 \geq s_{min}^2(A) ||z||^2$ where $s_{min}(A)$ is the minimum singular value of $A$.
\end{lemma}
\begin{lemma}[From Lemma A.3 in \citep{kristiadi2020being}]\label{lemma_35_4}  
Let $A\in \mathcal{R}^{d \times d}$ be an SPD matrix and $z \in \mathcal{R}^{d}$, then we have $z^T A z \geq \lambda_{min}(A) ||z||^2$, where $\lambda_{min}(A)$ is the minimum eigenvalue of $A$.
\end{lemma}
Before proving Proposition 3.5, we use the above three lemmas to prove Lemma \ref{lemma_35_3} first.
\begin{lemma}\label{lemma_35_3} Let $f_{\theta}: R^{|x|}\rightarrow R^C$ be a ReLU network for multi-class classification parameterized by $\theta$. Let $|x|$ represent the dimension of $x$ and $\theta \sim \mathcal{N}(\theta;\mu, \Sigma)$ by LA. Then for any input $x$, the estimated probability based on multi-class probit approximation shown in Eq.~\eqref{probit_approximation} fulfills 
\begin{equation}
    \lim_{\eta \rightarrow \infty} |\tau^{(c)}(\delta x)| \leq \frac{||w^{(c)}||}{s_{min}(J^{(c)})\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma)}} \quad \quad c=1,2,\cdots,C
\end{equation}
where $w = [w^{(1)},w^{(2)},\cdots,w^{(C)}]\in \mathcal{R}^{|x|\times C}$ is a matrix that only depends on $\mu$. $J^{(j)} = \frac{\partial w^{(j)}}{\partial \theta}|_{\theta = \mu}$ is the Jacobian matrix of $w^{(j)}$ with respect to $\theta$ at $\theta = \mu$. $\lambda_{min}(\Sigma)$ is the minimum eigenvalue while $s_{min}$ represents the minimum singular value.
\end{lemma}
\begin{proof}
We follow the proof of Theorem 2.3 in \citep{kristiadi2020being}, where they focus on binary classification problems and we extend it to the multi-class cases. 

By Lemma \ref{lemma_35_1}, there must exist $\alpha\geq 0$ and a linear region $R$ such that $\delta x \in R$ for all $\delta \geq \alpha$. We have the restriction $f_{\theta} |_R$ that can be expressed as $f_\theta |_R (x) = w^T x + q$ where $w = [w^{(1)},w^{(2)},\cdots,w^{(C)}]\in \mathcal{R}^{|x|\times C}$ and $q \in \mathcal{R}^{C}$. $w,q$ can be regarded as constants with respect to $\delta x$ that only depend on $\mu$. Let $f_\theta(\delta x) = [f^{(1)}_\theta(\delta x),f^{(2)}_\theta(\delta x),...,f^{(C)}_\theta(\delta x)]^T$ and $q = [q^{(1)}, q^{(2)},...,q^{(C)}]^T$. The gradient of $f_\theta^{(c)}(\delta x) (c=1,2,...,C)$ with respect to $\theta = \mu$ can be expressed as
\begin{equation}
    d_c(\delta x) =  \frac{\partial \delta w^{(c)^T} x + q^{(c)}}{\partial \theta}|_{\mu} = \delta (\frac{\partial w^{(c)}}{\partial \theta}|_\mu^T x + \frac{1}{\delta} \frac{\partial q^{(c)}}{\partial \theta}|_\mu) := \delta (J^{(c)^T}x + \frac{1}{\delta} \nabla_\theta q^{(c)}|_\mu)
\end{equation}
Then based on the multi-class probit approximation shown in Eq.~\eqref{probit_approximation}, we have 
\begin{equation}\label{eq:tau}
    \begin{split}
        |\tau^{(c)}(\delta x)| &= \frac{|\delta w^{(c)^T} x + q^{(c)}|}{\sqrt{1+\frac{\pi}{8} d_{c}(\delta x)^T \Sigma d_{c}(\delta x)}} \\
        & = \frac{|\delta (w^{(c)^T} x + \frac{1}{\delta} q^{(c)})|}{\sqrt{1+\frac{\pi}{8}\delta^2(J^{(c)^T}x + \frac{1}{\delta} \nabla_\theta q^{(c)}|_\mu)^T\Sigma (J^{(c)^T}x + \frac{1}{\delta} \nabla_\theta q^{(c)}|_\mu)}} \\
        & = \frac{|w^{(c)^T} x + \frac{1}{\delta} q^{(c)}|}{\sqrt{\frac{1}{\delta^2}+\frac{\pi}{8} (J^{(c)^T}x + \frac{1}{\delta} \nabla_\theta q^{(c)}|_\mu)^T\Sigma (J^{(c)^T}x + \frac{1}{\delta} \nabla_\theta q^{(c)}|_\mu)}} \textbf{}
    \end{split}
\end{equation}
When $\delta \rightarrow \infty$, Eq.~\eqref{eq:tau} becomes
\begin{equation}
    \lim_{\delta \rightarrow \infty} |\tau^{(c)}(\delta x)| = \frac{|w^{(c)^T} x |}{\sqrt{\frac{\pi}{8} (J^{(c)^T}x)^T \Sigma (J^{(c)^T}x) }} 
\end{equation}
Then by using Lemma \ref{lemma_35_2} and \ref{lemma_35_4} with Cauchy-Schwarz inequality, and noting that $s_{min}(J^{(c)})=s_{min}(J^{(c)^T})$, we have 
\begin{equation}
\begin{split}
       \lim_{\delta \rightarrow \infty} |\tau^{(c)}(\delta x)| &= \frac{||w^{(c)^T} x ||}{\sqrt{\frac{\pi}{8} (J^{(c)^T}x)^T \Sigma (J^{(c)^T}x) }}  \\
       & \leq \frac{||w^{(c)}||~ ||x|| }{\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma)||J^{(c)^T}x||^2}} \\
       &\leq \frac{||w^{(c)}||~ ||x|| }{\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma)s_{min}^2(J^{(c)^T})||x||^2}} =  \frac{||w^{(c)}||}{s_{min}(J^{(c)})\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma)}}\\
\end{split}
\end{equation}
\end{proof}
Given a probabilistic ensemble model with $N$ components, let $f_{\theta_i}: R^{|x|}\rightarrow R^C$ be a ReLU network for multi-class classification parameterized by $\theta_i$ ($i=1,2,...,N$). For probabilistic ensemble model, we have $\theta \sim \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)$. Based on Eq.~\eqref{probit_approximation} and Lemma \ref{lemma_35_3}, we have the following property for a single model $f_{\theta_i}$.
\begin{equation}
    \begin{split}
            \lim_{\delta \rightarrow \infty} p_i(y=c|\delta x,\mathcal{D}) & = \frac{\exp(\tau^{(c)}_i(\delta x))}{\sum_{j=1}^C \exp(\tau^{(j)}_i(\delta x))} = \frac{1}{1+\sum_{j\neq c} \exp(\tau^{(j)}_i(\delta x)-\tau^{(c)}_i(\delta x))}\\
            & \leq \frac{1}{1+\sum_{j\neq c} \exp(-|\tau^{(j)}_i(\delta x)|-|\tau^{(c)}_i(\delta x)|)} \\
            & \leq \frac{1}{1+\sum_{j\neq c} \exp \left\{ -\frac{||w_i^{(j)}||}{s_{min}(J_{i}^{(j)})\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma_i)}}-\frac{||w_i^{(c)}||}{s_{min}(J_{i}^{(c)})\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma_i)}}\right\}}
    \end{split}
\end{equation}
where $w_i = [w_i^{(1)},w_i^{(2)},\cdots,w_i^{(C)}]\in R^{|x|\times C}$ is a matrix that only depends on $\theta_i$. $J_i^{(j)} = \frac{\partial w_i^{(j)}}{\partial \theta}|_{\theta = \theta_i}$ is the Jacobian matrix of $w_i^{(j)}$ with respect to $\theta$ at $\theta = \theta_i$.

Then for the probabilistic ensemble $\theta \sim \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i)$
\begin{equation}
    \begin{split}
        \lim_{\delta \rightarrow \infty} p_{PE}(y=c|\delta x,\mathcal{D}) &= \lim_{\delta \rightarrow \infty} \int p(y=c|\delta x,\theta) \sum_{i=1}^N \lambda_i \mathcal{N}(\theta;\theta_{i}, \Sigma_i) d\theta \\
        & = \lim_{\delta \rightarrow \infty} \sum_{i=1}^N \lambda_i \int p(y=c|\delta x,\theta)  \mathcal{N}(\theta;\theta_{i}, \Sigma_i) d\theta \\
        & = \lim_{\delta \rightarrow \infty} \sum_{i=1}^N \lambda_i p_i(y=c|\delta x,\mathcal{D}) \\
        & \leq  \sum_{i=1}^N  \frac{\lambda_i}{1+\sum_{j\neq c} \exp \left\{ -\frac{||w_i^{(j)}||}{s_{min}(J_{i}^{(j)})\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma_i)}}-\frac{||w_i^{(c)}||}{s_{min}(J_{i}^{(c)})\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma_i)}}\right\}}
    \end{split}
\end{equation}
Letting
$$
    t_i^{(k)} = \frac{||w_i^{(k)}||}{s_{min}(J_{i}^{(k)})\sqrt{\frac{\pi}{8}\lambda_{min}(\Sigma_i)}} \quad k = 1,2,\cdots,C
$$
we have 
\begin{equation}
    \lim_{\eta \rightarrow \infty} p_{PE}(y=c|\eta x) \leq \sum_{i=1}^N  \frac{\lambda_i}{1+\sum_{j\neq c} \exp \{ -t_i^{(j)}-t_i^{(c)}\}}
\end{equation}

\end{proof}
\section{Adaptive Uncertainty-guided Ensemble Learning Proposition}

\subsection{Proof of Proposition 3.6}
\begin{proof}
In fact, let $f$ be any classifier with input $x$ and denote $y$ as the corresponding label. Following \cite{hellman1970probability}, we have
\begin{equation}
    \label{prediction_error}
    Pr(y\neq f(x)) \leq \frac{\mathcal{H}(y)-MI(x,y)}{2} = \frac{1}{2} \mathcal{H}(y|x)
\end{equation}
where $MI(x,y)$ is the mutual information between $x$ and $y$. Note that $\mathcal{H}(y|x)=\mathcal{H}[\mathbb{E}_{\theta}[p(y|x,\theta)]]$ is the total uncertainty, which is positively correlated with the prediction error. It indicates that minimizing the total uncertainty can lead to a better prediction error bound. Since total uncertainty is the sum of epistemic uncertainty and irreducible aleatoric uncertainty, the epistemic uncertainty is also positively correlated with the prediction error. 
\end{proof}

\section{Derivations for MoG Refinement}
In this section, we provide detailed derivations for the E-step and M-step.

E-step: construct the expected loss function of latent variable $Z$. Based on Eq.~(13) of the main body of the paper. 
\begin{equation}\label{detailed_e_step}
    \begin{split}
    Q(\phi|\phi^0,\mathcal{D}) &= \sum_{m=1}^M \sum_{i=1}^N p(Z=i|\mathcal{D}_m,\phi^0) \log \frac{p(\mathcal{D}_m,Z=i|\phi)}{p(Z=i|\mathcal{D}_m,\phi^0)} \\
    &\propto  \sum_{m=1}^M \sum_{i=1}^N p(Z=i|\mathcal{D}_m,\phi^0) \log p(\mathcal{D}_m,Z=i|\phi) \\
    &= \sum_{m=1}^M \sum_{i=1}^N p(Z=i|\mathcal{D}_m,\phi^0) [\log p(\mathcal{D}_m|Z=i,\phi)+\log p(Z=i|\phi)] \\
    &= \sum_{m=1}^M \sum_{i=1}^N p(Z=i|\mathcal{D}_m,\phi^0) [\log p(y_m|x_m, \theta_i,\Sigma_i)+\log \lambda_i]
\end{split}
\end{equation}
where
\begin{equation}
    \begin{split}
        p(Z=i|\mathcal{D}_m,\phi^0) &= \frac{p(\mathcal{D}_m|Z=i,\phi^0)p(Z=i|\phi^0)}{\sum_{j}p(\mathcal{D}_m|Z=j,\phi^0)p(Z=j|\phi^0)} = \frac{\lambda_i^0p(y_m|x_m, Z=i,\phi^0) }{\sum_{j}\lambda_j^0p(y_m|x_m, Z=j,\phi^0) } \\
        &= \frac{\lambda_i^0p(y_m|x_m, \theta_i^0, \Sigma_i^0) }{\sum_{j}\lambda_j^0p(y_m|x_m,\theta_j^0,\Sigma_j^0)} =  \frac{\lambda_i^0 \int p(y_m|x_m,\theta)N(\theta;\theta_i^0, \Sigma_i^0) d\theta  }{\sum_{j}\lambda_j^0\int p(y_m|x_m,\theta)N(\theta;\theta_j^0, \Sigma_j^0)d\theta}
    \end{split}
\end{equation}
and 
\begin{equation}
    \begin{split}
        p(y_m|x_m, \theta_i,\Sigma_i) = \int p(y_m|x_m,\theta)N(\theta;\theta_i, \Sigma_i)d\theta
    \end{split}
\end{equation}
which can be approximated either by MC sampling or probit approximation shown in Eq.~\eqref{probit_approximation}.

M-step: obtain the parameters $\phi$ by maximizing $Q(\phi|\phi^0,\mathcal{D})$, which include optimizing $\{\lambda_i\}_{i=1}^N$, $\{\theta_i\}_{i=1}^N$, and $\{\Sigma_i\}_{i=1}^N$

\paragraph{M-step for $\{\lambda_i\}_{i=1}^N$} Conditioned on $\sum_{i=1}^N \lambda_i =1$, we add a Lagrangian multiplier with coefficient $\alpha$ to  $Q(\phi|\phi^0,\mathcal{D})$ to solve the constrained problem.
\begin{equation}
    \hat{Q}(\phi|\phi^0,\mathcal{D}) = \sum_{m=1}^M \sum_{i=1}^N p(Z=i|\mathcal{D}_m,\phi^0) [\log p(y_m|x_m, \theta_i,\Sigma_i)+\log \lambda_i] -\alpha\left(\sum_{i=1}^N \lambda_i -1\right)
\end{equation}
To learn $\{\lambda_i\}_{i=1}^N$, we force the gradients of $\hat{Q}(\phi|\phi^0,\mathcal{D})$ with respect to $\{\lambda_i\}_{i=1}^N$ and $\alpha$ equal to 0 shown in Eq.~\eqref{eq:optimizing_lambda}.
\begin{equation} \label{eq:optimizing_lambda}
    \begin{split}
        & \frac{\partial \hat{Q}}{\partial \lambda_i} = \sum_{m=1}^M \frac{p(Z=i|\mathcal{D}_m,\phi^0)}{\lambda_i} - \alpha =0  \quad \quad i=1,2,...,N  \\
        & \frac{\partial \hat{Q}}{\partial \alpha} = \sum_{i=1}^N \lambda_i -1 = 0
    \end{split}
\end{equation}
Eq.~\eqref{eq:optimizing_lambda} indicates
\begin{equation}
    \lambda_i^* = \frac{\sum_{m=1}^M p(Z=i|\mathcal{D}_m,\phi^0)}{\sum_{m=1}^M \sum_{j=1}^N p(Z=j|\mathcal{D}_m,\phi^0)}
\end{equation}

\paragraph{M-step for $\{\theta_i\}_{i=1}^N$} Based on Eq.~\eqref{detailed_e_step}, we can observe that maximizing $Q(\phi|\phi^0,\mathcal{D})$ with respect to $\theta$ is equal to maximizing the $Q(\theta_i|\phi^0, \mathcal{D})$ independently. $Q(\theta_i|\phi^0, \mathcal{D})$ is shown in Eq.~\eqref{eq:maximize_theta_i}.
\begin{equation}\label{eq:maximize_theta_i}
    Q(\theta_i|\phi^0, \mathcal{D}) = \sum_{m=1}^M  p(Z=i|\mathcal{D}_m,\phi^0) \log p(y_m|x_m, \theta_i,\Sigma_i)
\end{equation}
where $p(Z=i|\mathcal{D}_m,\phi^0)$ is the membership weight of data pair $(x_m,y_m)$ belonging to the $i$th ensemble component $N(\theta_i,\Sigma_i)$. Noting that $\Sigma_i$ can always be computed by Laplacian approximation in a post-processing manner in our framework, we only need to optimize $\theta_i$ in a deterministic way and the loss function is shown in Eq.~\eqref{eq:maximize_theta_i_nosigma}. 
\begin{equation}\label{eq:maximize_theta_i_nosigma}
    \hat{Q}(\theta_i|\phi^0, \mathcal{D}) = \sum_{m=1}^M  p(Z=i|\mathcal{D}_m,\phi^0) \log p(y_m|x_m, \theta_i)
\end{equation}
where $p(y_m|x_m, \theta_i)$ is the softmax probability generated directly by $i$th ensemble component. Due to the uncertainty-guided ensemble training strategy, different ensemble models will focus on different samples, leading to different $p(Z=i|\mathcal{D}_m,\phi^0), i=1,2,...,N$. Directly optimizing $ \hat{Q}(\theta_i|\phi^0, \mathcal{D})$ will strengthen the samples that each model focuses on, which implicitly enhances the diversity. To further improve the diversity, we can assign each data sample to its top $l$ nearest component based on $p(Z=i|\mathcal{D}_m,\phi^0)$. For example, let's assume $p(Z=1|\mathcal{D}_m,\phi^0) > p(Z=2|\mathcal{D}_m,\phi^0) >\cdots>p(Z=N|\mathcal{D}_m,\phi^0)$ and $l=2$. We will assign $(x_m,y_m)$ to the first and second ensemble components. Then we can fine-tune each ensemble model with a higher concentration of the data samples they receive by performing the stochastic gradient ascent. The loss function is shown in Eq.~\eqref{optimize_theta_i}:
\begin{equation}\label{optimize_theta_i}
    \theta_i^* = \arg \max_{\theta_i} \sum_{m=1}^M  softmax(I_{l} [p(Z=i|\mathcal{D}_m,\phi^0)]) \log p(y_m|x_m, \theta_i)
\end{equation}
where $I_{l}$ is the indicator function, which returns 1 if $p(Z=i|\mathcal{D}_m,\phi^0)$ is the top $l$ largest among all $\{p(Z=j|\mathcal{D}_m,\phi^0)\}_{j=1}^N$ and returns 0 otherwise. The softmax function is applied for each batch of the data to ensure that the sum of the weights equals to 1, which is similar to Eq.~(12) of the main body of the paper.

\paragraph{M-step for $\{\Sigma_i\}_{i=1}^N$} Once we have $\theta_i^*$, we perform the LA to get $\Sigma_i^*$.



\section{Experiment Settings and Implementation} 
\subsection{Model Architecture and Hyperparameters}\label{implement}
For the MNIST dataset, we use the architecture: Conv2D-Relu-Conv2D-Relu-MaxPool2D-Dense-Relu-Dropout-Dense-Softmax. Each convolutional layer contains 32 convolution filters with $4\times4$ kernel size. We use a max-pooling layer with a $2\times2$ kernel, a dense layer with 128 units, and a dropout probability of 0.5. For the CIFAR-10 dataset, we use ResNet18. We use the SGD optimizer with an initial learning rate of $0.1$ and momentum of $0.9$ for both MNIST and CIFAR-10. For CIFAR-10, we decrease the learning rate to 0.01,0.001,0.0001 at the 30th, 60th, and 90th epochs while there is no learning rate decrease for MNIST. For MNIST, the batch size is set to 128 and the maximum epoch is 30. For CIFAR-10, the batch size is 128 and the maximum epoch is 120. We perform the standard data augmentation techniques for CIFAR-10 dataset including random cropping and random horizontal flipping. For constructing the probabilistic ensemble, we use the last-layer LA implemented by \cite{daxberger2021laplace}, which can be found at \url{https://github.com/AlexImmer/Laplace}. We generate 200 samples from the mixture of Gaussian model for uncertainty quantification. Regarding to the uncertainty-guided ensemble learning strategy, we use $a=0.05, b=1$ as hyperparameters. We utilize uniform coefficients during AUEL when constructing the PE model for estimating uncertainty to guide the training of the next model. For the MoG refinement, we choose $l=2$ for the PE model with 5 components. All the ensemble models have size 5. Each experiment is conducted over 3 independent runs and the standard derivations are also reported. We utilize an RTX2080Ti GPU to do the experiments and the proposed method is implemented using Pytorch.
\begin{figure}[ht]
    \centering
    \includegraphics[width=0.7\linewidth]{PE_shift_mnist_additional.pdf}
    \caption{Additional uncertainty calibration metrics for rotated MINST dataset.}
    \label{fig:mnist_shift_additional}
\end{figure}
\subsection{Implementation Details}
In this section, we will discuss the implementation details for different uncertainty estimation methods. We use the default hyperparameters in their open-source codes except the hyperparameters mentioned in Appendix \ref{implement}.
\begin{itemize}
    \item ESB: we train ensemble models with random initialization following the experiment settings in Appendix \ref{implement}.
    \item Batch-E: the open-source code can be found in \url{https://github.com/giannifranchi/LP_BNN}.
    \item Hyper-E: we train the ensemble models by varying both the initialization and the weight decay coefficients following the implementation in \url{https://github.com/google/uncertainty-baselines/blob/main/baselines/notebooks/Hyperparameter_Ensembles.ipynb}.
    \item Bayes-E: we follow the open-source code in \url{https://github.com/TeaPearce/Bayesian_NN_Ensembles}.
    \item LPBNN: the open-source code can be found in \url{https://github.com/giannifranchi/LP_BNN}.
    \item LA: we use the last-layer LA with full Hessian matrix computation as discussed in Appendix A. We use the existing software proposed by \cite{daxberger2021laplace}, which can be found at \url{https://github.com/AlexImmer/Laplace}
    \item Multi-SWAG: we utilize the implementation provided by \url{https://github.com/izmailovpavel/understandingbdl}.
    \item Diversified-E: we train all the ensemble models simultaneously with a regularization term for the diversity. We extend the Eq.~(4) in \citep{zhang2020diversified} for multi-class classification. \\
    \item MCT: we implement the Algorithm 1 shown in Appendix C of \cite{lee2015m}. 
\end{itemize}
The model architecture, training strategy, and data transformation are the same for all baselines. The specific hyperparameters for ensemble baselines are chosen following their open-source codes. 

\section{Uncertainty Calibration Under Distributional Shifts}
\subsection{Within-dataset Performance}
The within-dataset performance for different uncertainty quantification methods can be found in Table \ref{tab:within_dataset}. One key observation is the superior performance of our proposed method on the CIFAR-10 dataset across multiple metrics. Notably, it achieves a 15\% improvement in NLL and a 53\% enhancement in ECE. In the case of the MNIST dataset, our method holds its ground against other ensemble-based approaches, showing comparable results. It is worth noting that our method demonstrates unique strength in scenarios involving complex datasets where there is a significant distributional shift between training and testing data. In these circumstances, our approach often yields substantial performance gains. Conversely, in simpler datasets with minor within-dataset shifts, our method's performance is on par with other techniques, 

\begin{table*}[ht]
% \fontsize{8.5}{9}\selectfont
	\caption{Within-dataset performance for ACC(\%), NLL($\times 10^{-1}$), ECE($\times 10^{-2}$), BS ($\times 10^{-3}$) on MNIST and CIFAR-10. Each experiment result is aggregated over 3 independent runs.}
	\label{tab:within_dataset}
	\centering
\begin{tabular}{|l|cccc|cccc|}

\hline
	\multirow{2}{*}{Method} & \multicolumn{4}{|c|}{MNIST} &\multicolumn{4}{|c|}{CIFAR-10}  \\ \cline{2-9}
	&ACC &NLL &ECE &BS  &ACC &NLL &ECE &BS  \\
\hline  
\multirow{1}{*}{Ours}  &\textbf{99.43}  &0.211  &0.66 &\textbf{1.0}  &\textbf{95.28}  &\textbf{1.44}  & \textbf{0.35}  & \textbf{7.0} \\
\multirow{1}{*}{ESB}  &99.41  &0.196  &0.46 &\textbf{1.0} &94.63   &1.70 & 0.75 &7.9 \\
\multirow{1}{*}{Batch-E}   &98.92 &0.352  &\textbf{0.25} &1.7 &92.66 &2.49 & 3.04 &11.3\\
\multirow{1}{*}{Hyper-E}  &99.39  &\textbf{0.190}  &0.32  &\textbf{1.0}  &95.13  &1.49  & 0.53  &7.2  \\
\multirow{1}{*}{Bayes-E}  &99.28 &0.236  &0.40  &1.1  &93.94  &1.90 & 0.93 &8.9 \\
\multirow{1}{*}{LPBNN}  &98.91 &0.345  &0.35  &1.7  &93.43  &2.30 & 2.95  &10.3\\
\multirow{1}{*}{LA}   &99.13  &0.274  &0.28  &1.3 &93.31  &2.21 & 2.03  &10.3 \\
\multirow{1}{*}{Multi-SWAG} &99.33  &0.234 &0.48 &1.1 &93.71 &1.76 & 0.54 &8.8 \\
\hline
\end{tabular}\\
\end{table*}

\subsection{Additional Results on Different Calibration Metrics}
In this particular section, we present a wealth of supplementary results associated with uncertainty calibration metrics. Specifically, we will focus on metrics such as AUROC, AUPR, MCE, and ACC for both rotated MNIST and corrupted CIFAR-10 datasets. Our discussion is further enriched by Figures \ref{fig:mnist_shift_additional} and \ref{fig:cifar_shift_additional}, which visually depict the performance differences between our proposed methodology and other competitive techniques across various uncertainty calibration metrics. A careful scrutiny of these figures reveals a distinct advantage of our method: it outperforms the competition in a diverse set of uncertainty calibration metrics.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.7\linewidth]{PE_shift_cifar_additional.pdf}
    \caption{Additional uncertainty calibration metrics for corrupted CIFAR-10 dataset.}
    \label{fig:cifar_shift_additional}
\end{figure}
\clearpage
\section{Visualizations of Diversity analysis}

In this segment, we supplement our discussion with visual representations relating to both parameter space and prediction space diversities. Essentially, we utilize Principal Component Analysis (PCA) to render the neural network parameters and the predictive logits for MNIST testing data into a two-dimensional plane.

We use Ensemble (ESB), Hyper Ensemble (Hyper-E), and Bayesian Ensemble (Bayes-E) as baseline methods for comparative analysis. The visual depictions, presented pairwise in Figure \ref{fig:diversity_mnist}, offer empirical evidence that our proposed technique results in an enhanced diversity via a probabilistic ensemble supplemented by uncertainty-guided ensemble learning. While the Bayes-E and Hyper-E can also bolster diversity, our proposed method demonstrates marked enhancement.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.82\linewidth]{PE_diversity_mnist_all.pdf}
    \caption{The visualizations of the prediction space (the first row) and parameter space (the second row) diversity. The comparison is conducted pairwisely between our proposed method and other ensemble-based methods.}
    \label{fig:diversity_mnist}
\end{figure}
\clearpage
\section{Ablation Studies and Further Analysis}
\subsection{Effectiveness of Sub-modules}
In this section, we provide supplementary experiment results regarding the OOD detection and uncertainty calibration to further demonstrate the effectiveness of each sub-module, which are shown in Table \ref{tab:ood_result_ablation}, Figure \ref{fig:ablation_mnist}, and Figure \ref{fig:ablation_cifar}. Typically, the AUEL module can more significantly improve the performances on CIFAR-10 dataset than on MNIST dataset. This is mainly because MNIST dataset is simpler such that the training samples can all achieve small uncertainties with little differences among them. The PE module could improve the performances on both datasets with various metrics, especially for AUROC, AUPR, ECE and NLL. The refinement of MoG parameters works better on MNIST dataset and shows marginal improvements on CIFAR-10 dataset. This is reasonable since the MoG refinement can further improve the expertise of each ensemble component for MNIST dataset. However, for CIFAR-10, the AUEL already enhances the specialty of each ensemble component and it is not necessary to perform the refinement. Hence, the MoG refinement performs better on simpler datasets where the within-dataset uncertainties are all small and similar.

\begin{table*}[ht] 
	\caption{Effectiveness of sub-modules: additional OOD detection results for AUROC (\%) and AUPR (\%) on MNIST-related and C10-related datasets with epistemic uncertainty (EU).}
	\label{tab:ood_result_ablation}
	\centering
\begin{tabular}{|l|cc|cc|cc|}
\hline
	\multirow{2}{*}{Method} & \multicolumn{2}{|c|}{MNIST $\rightarrow$ EMNIST} &\multicolumn{2}{|c|}{MNIST $\rightarrow$ KMNIST} \\ \cline{2-5}
	&AUROC&AUPR &AUROC &AUPR\\
\hline  
\multirow{1}{*}{ESB}  & $97.32 \pm \scalebox{0.85}{0.14}$ & $96.10 \pm \scalebox{0.85}{0.46}$ & $97.92\pm \scalebox{0.85}{0.10}$ & $97.13\pm \scalebox{0.85}{0.27}$ \\
\multirow{1}{*}{AUEL}  & $97.55 \pm \scalebox{0.85}{0.19}$ & $96.32 \pm \scalebox{0.85}{0.40}$ & $97.97\pm \scalebox{0.85}{0.19}$ & $97.23\pm \scalebox{0.85}{0.14}$\\
\multirow{1}{*}{AUEL + PE}  & $98.01 \pm \scalebox{0.85}{0.07}$ & $97.26 \pm \scalebox{0.85}{0.08}$ & $98.39\pm \scalebox{0.85}{0.11}$ & $97.98\pm \scalebox{0.85}{0.08}$ \\
\multirow{1}{*}{AUEL+RPE}  &  $\textbf{98.42} \pm \scalebox{0.85}{0.03}$ & $\textbf{98.22}\pm \scalebox{0.85}{0.02}$ &$\textbf{98.90} \pm \scalebox{0.85}{0.04}$ &$\textbf{98.77}\pm \scalebox{0.85}{0.06}$ \\
\hline
\end{tabular}

\begin{tabular}{|l|cc|cc|cc|}
\hline
	\multirow{2}{*}{Method} & \multicolumn{2}{|c|}{C10 $\rightarrow$ LSUN} &\multicolumn{2}{|c|}{C10 $\rightarrow$ C100} \\ \cline{2-5}
	&AUROC&AUPR &AUROC &AUPR\\
\hline  
\multirow{1}{*}{ESB}  & $88.42 \pm \scalebox{0.85}{0.85}$ & $84.99 \pm \scalebox{0.85}{0.65}$ & $91.87\pm \scalebox{0.85}{0.58}$ & $88.69\pm \scalebox{0.85}{0.55}$\\
\multirow{1}{*}{AUEL}  & $89.16 \pm \scalebox{0.85}{0.12}$ & $85.55 \pm \scalebox{0.85}{0.18}$ & $92.73\pm \scalebox{0.85}{0.16}$ & $89.71\pm \scalebox{0.85}{0.57}$ \\
\multirow{1}{*}{AUEL + PE}  & $89.57 \pm \scalebox{0.85}{0.08}$ & $86.81 \pm \scalebox{0.85}{0.14}$ & $93.80\pm \scalebox{0.85}{0.11}$ & $91.67\pm \scalebox{0.85}{0.36}$ \\
\multirow{1}{*}{AUEL+RPE} & $\textbf{89.58} \pm \scalebox{0.85}{0.11}$ & $\textbf{86.86} \pm \scalebox{0.85}{0.18}$ & $\textbf{93.93}\pm \scalebox{0.85}{0.13}$ & $\textbf{91.93}\pm \scalebox{0.85}{0.39}$ \\
\hline
\end{tabular}

\end{table*}
\begin{figure}[h]
    \centering
    \includegraphics[width=0.75\linewidth]{ablation_mnist.pdf}
    \caption{Effectiveness of sub-modules: additional uncertainty calibration results for rotated MNIST dataset with various metrics such as ECE, Brier Score, NLL, ACC, MCE, and AUROC.}
    \label{fig:ablation_mnist}
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.73\linewidth]{ablation_cifar.pdf}
    \caption{Effectiveness of sub-modules: additional uncertainty calibration results for noisy CIFAR-10 dataset with various metrics such as ECE, Brier Score, NLL, ACC, MCE, and AUROC.}
    \label{fig:ablation_cifar}
\end{figure}

\subsection{Probabilistic Ensemble as a Plug-and-Play Module} 
In this section, we treat the probabilistic ensemble as a plug-and-play module and add it to Bayes-E and Hyper-E to show further improvements for both OOD detection and uncertainty calibration performances. Additional experiment results are shown in Table \ref{tab:ood_plug} for OOD detection and Figure \ref{fig:plug_mnist}, \ref{fig:plug_cifar} for uncertainty calibration under distributional shifts.
\begin{figure}[ht]
    \centering
    \includegraphics[width=0.74\linewidth]{ablation_plug_and_play_mnist.pdf}
    \caption{Probabilistic ensemble as a plug-and-play module: additional uncertainty calibration results for rotated MNIST dataset with various metrics such as ECE, Brier Score, NLL, ACC, MCE, and AUROC.}
    \label{fig:plug_mnist}
\end{figure}

\clearpage

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.75\linewidth]{ablation_plug_and_play_cifar.pdf}
    \caption{Probabilistic ensemble as a plug-and-play module: additional uncertainty calibration results for noisy CIFAR-10 dataset with various metrics such as ECE, Brier Score, NLL, ACC, MCE, and AUROC.}
    \label{fig:plug_cifar}
\end{figure}

\begin{table*}[ht] 
	\caption{Probabilistic ensemble as a plug-and-play module: additional OOD detection results for AUROC (\%) and AUPR (\%) on MNIST-related and C10-related datasets with epistemic uncertainty (EU).}
	\label{tab:ood_plug}
	\centering
\begin{tabular}{|l|cc|cc|cc|}
\hline
	\multirow{2}{*}{Method} & \multicolumn{2}{|c|}{MNIST $\rightarrow$ EMNIST} &\multicolumn{2}{|c|}{MNIST $\rightarrow$ KMNIST} \\ \cline{2-5}
	&AUROC&AUPR &AUROC &AUPR\\
\hline  
\multirow{1}{*}{Bayes-E} &$97.07 \pm \scalebox{0.85}{0.29}$ & $95.86 \pm \scalebox{0.85}{0.33}$ & $97.73\pm \scalebox{0.85}{0.06}$ & $96.72\pm \scalebox{0.85}{0.14}$ \\
\multirow{1}{*}{Bayes-E + PE} & $\textbf{97.82} \pm \scalebox{0.85}{0.08}$ & $\textbf{97.16} \pm \scalebox{0.85}{0.10}$ & $\textbf{98.28}\pm \scalebox{0.85}{0.02}$ & $\textbf{97.75}\pm \scalebox{0.85}{0.00}$ \\
\multirow{1}{*}{Hyper-E}  & $97.56 \pm \scalebox{0.85}{0.31}$ & $96.68 \pm \scalebox{0.85}{0.51}$ & $97.92\pm \scalebox{0.85}{0.43}$ & $97.32\pm \scalebox{0.85}{0.53}$ \\
\multirow{1}{*}{Hyper-E + PE}  & $\textbf{98.10} \pm \scalebox{0.85}{0.10}$ & $\textbf{97.59} \pm \scalebox{0.85}{0.18}$ & $\textbf{98.46}\pm \scalebox{0.85}{0.14}$ & $\textbf{98.14}\pm \scalebox{0.85}{0.13}$\\
\hline
\end{tabular}

\begin{tabular}{|l|cc|cc|cc|}
\hline
	\multirow{2}{*}{Method} & \multicolumn{2}{|c|}{C10 $\rightarrow$ LSUN} &\multicolumn{2}{|c|}{C10 $\rightarrow$ C100} \\ \cline{2-5}
	&AUROC&AUPR &AUROC &AUPR\\
\hline  
\multirow{1}{*}{Bayes-E}  & $87.85 \pm \scalebox{0.85}{1.22}$ & $84.56 \pm \scalebox{0.85}{1.01}$ & $91.80\pm \scalebox{0.85}{0.45}$ & $88.83\pm \scalebox{0.85}{0.02}$ \\
\multirow{1}{*}{Bayes-E + PE}  & $\textbf{89.17} \pm \scalebox{0.85}{0.22}$ & $\textbf{87.13} \pm \scalebox{0.85}{0.40}$ & $\textbf{94.07}\pm \scalebox{0.85}{0.64}$ & $\textbf{92.50}\pm \scalebox{0.85}{1.42}$  \\
\multirow{1}{*}{Hyper-E}  & $88.82 \pm \scalebox{0.85}{0.15}$ & $85.29 \pm \scalebox{0.85}{0.25}$ & $92.59\pm \scalebox{0.85}{0.24}$ & $89.65\pm \scalebox{0.85}{0.71}$ \\
\multirow{1}{*}{Hyper-E + PE}  & $\textbf{89.25} \pm \scalebox{0.85}{0.14}$ & $\textbf{86.46} \pm \scalebox{0.85}{0.42}$ & $\textbf{93.63}\pm \scalebox{0.85}{0.37}$ & $\textbf{91.52}\pm \scalebox{0.85}{0.86}$\\
\hline
\end{tabular}

\end{table*}

\subsection{Efficiency Analysis} \label{appendix_efficiency}

Let $T$ be the cost of training a deterministic model, $N$ be the ensemble size, $M$ be the number of total parameters, $P$ be the number of last-layer parameters, $C$ be the number of classes, and $S$ be the number of samples generated for LA and SWAG (S=200 for our experiment). Table 1 presents the theoretical/empirical complexities for both training and inference. Training complexity represents the cost of training a single ensemble component, while inference complexity indicates the cost of UQ for a data sample unit. Empirical training and inference runtimes are based on the C10 dataset, reporting the average one-epoch training runtimes and the UQ runtimes for C10 test set.

\begin{table*}[t]
    \centering
    \caption{Average number of parameters, training and inference complexity/runtimes for all baselines on C10 dataset.}
    \label{tab:complexity}
\begin{tabular}{|l|cc|cc|cc|}
\hline
	\multirow{2}{*}{Method} & \multicolumn{2}{|c|}{Theoretical Complexity} &\multicolumn{2}{|c|}{Empirical Runtime } \\ \cline{2-5}
	&Training & UQ &Training & UQ  \\
\hline  
\multirow{1}{*}{Ours} & $O(T+M+C^3+P^3)$ & $O(NM+SP)$ & 35s & 15.4s  \\
\multirow{1}{*}{ESB} & $O(T)$ & $O(NM)$ & 34s & 5.3s \\
\multirow{1}{*}{Hyper-E} & $>O(T)$ & $O(NM)$ & 35s & 5.7s \\
\multirow{1}{*}{Bayes-E} & $O(T)$ & $O(NM)$ & 35s & 5.6s \\
\multirow{1}{*}{Batch-E} & $<O(T)$ & $<O(NM)$ & 26s & 4.3s \\
\multirow{1}{*}{LPBNN} & $<O(T)$ & $<O(NM)$ & 28s & 4.4s \\
\multirow{1}{*}{LA} & $O(T+M+C^3+P^3)$ & $O(M+SP)$ & 35s & 7.6s  \\
\multirow{1}{*}{Multi-SWAG} & $O(T+M^2)$ & $O(NSM)$ & 36s & 190.6s  \\
\multirow{1}{*}{Diversified-E} & $O(T)$ & $O(NM)$ & 34s & 5.3s \\
\multirow{1}{*}{MCT} & $O(T)$ & $O(NM)$ & 34s & 5.3s \\
\hline
\end{tabular}
\end{table*}

Compared to ESB, our method has the additional cost of constructing $N$ LAs during training, with each taking 
$O(M+C^3+P^3)$. Constructing a single network LA takes about 15s for C10, which is negligible compared to total training time. Inference complexity involves generating samples from a Gaussian mixture, with an additional cost of $O(SP)$ compared to ESB. If we generate 200 samples, the uncertainty estimation runtime for 10000 testing images of MNIST/C10 is 3.8s/15.4s for our method. It takes about 0.03s to obtain one more
sample for uncertainty quantification. ESB, Bayes-E, and MCT share the same training/inference complexities. However, Hyper-E usually takes a longer time for training an ensemble model pool and applying a greedy search strategy to select optimal ensemble models. Batch-E and LPBNN are more efficient, thanks to weight-sharing. The training and inference complexities depend on the number of shared weights. Multi-SWAG is more computationally expensive since it requires constructing a Gaussian posterior approximation with a low-rank covariance matrix during training. Diversified-E has similar training and inference complexity compared to ESB but requires a larger memory for computing the pairwise distance among models.
 
Moreover, we also provide the OOD detection results of the ensemble-based methods on different ensemble sizes for MNIST-related and CIFAR-related datasets. The baseline methods include ESB, Hyper-E, and Bayes-E. The results are shown in Figure \ref{fig:efficiency_mnist} and Figure \ref{fig:efficiency_cifar}. With limited computational resources, we only need to construct the probabilistic ensemble model with a small size to achieve competitive performance compared to other ensemble-based methods with large sizes. Sometimes, PE model with 2 components can even achieve better performances compared to other ensemble-based methods with 10 components, i.e., C10 $\rightarrow$ SVHN shown in Figure \ref{fig:efficiency_mnist} (a). 


\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\linewidth]{ablation_efficiency_mnist.pdf}
    \caption{Efficiency of Probabilistic Ensemble: OOD detection results for MNIST-related datasets with metrics AUROC and AUPR.}
    \label{fig:efficiency_mnist}
\end{figure}
\clearpage
\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\linewidth]{ablation_efficiency_cifar.pdf}
    \caption{Efficiency of Probabilistic Ensemble: OOD detection results for CIFAR10-related datasets with metrics AUROC and AUPR.}
    \label{fig:efficiency_cifar}
\end{figure}

\section{Application to Larger Datasets}
The proposed method can be scaled up to larger datasets and larger models. In this section, we apply our proposed method to CIFAR-100 (C100) and TinyImagenet (TIM) with empirical results. 

We utilize Resnet152 as the backbone for both datasets. The training hyperparameters are illustrated below. For CIFAR-100, the maximum epoch is set to be 120 and the batch size is 128. We use an SGD optimizer with an initial learning rate of 0.1 and momentum of 0.9. During training, the learning rate decreases to 0.01, 0.001, 0.0001 at the 30th, 60th, and 90th epoch. For TinyImagenet, the maximum epoch is 80 with batch size 128. We use the same optimizer as CIFAR-100 with learning rate decay at the 20th, 40th, and 60th epoch. The standard data augmentation is conducted for both datasets including random cropping and random horizontal flipping. We randomly select 10\% of the training data as validation data for CIFAR-100  while the validation data is of TinyImagenet is provided. To construct the probabilistic ensemble model after training, we follow the same experiment settings shown in  Appendix \ref{implement}. 

During the evaluation, we show the OOD detection results and the uncertainty calibration performance under distributional shifts, respectively. For CIFAR-100 and TinyImagenet, we use LSUN and CIFAR-10 as the OOD datasets. We use the same evaluation metrics illustrated in Sections 4.1 and 4.2 of the main body of the paper. We compare the proposed AUEL+PE with ESB, Hyper-E, and Bayes-E. The experiments are conducted on CIFAR-100 and TinyImagenet testing datasets for OOD detection. The uncertainty calibration evaluation is conducted on the validation dataset of Tinyimagenet since TIM testing data does not provide labels. To create the corrupted C100 and corrupted TIM datasets, we add the Gaussian noise with 0 mean and variance ranging from 0 to 0.25 with a step of 0.05 to the original datasets following Sec. 4.2 of the main body of the paper.

The experiment results are shown in Table \ref{tab:large_dataset} for OOD detection and Figures \ref{fig:cifar100_calibration}, \ref{fig:tiny_calibration} for uncertainty calibration performance. With the enhanced diversity of our proposed method, we can consistently outperform other baselines for both tasks. For the uncertainty calibration under distributional shifts, we basically obtain increasing performance improvement as the shifts become more significant, indicating the better generalization ability enforced by improved diversity.

\begin{table*}[ht]
% \fontsize{8.5}{9}\selectfont
	\caption{OOD Detection Results for AUROC (\%) and AUPR (\%) on CIFAR-100 and TinyImagenet with Epistemic Uncertainty. Each experiment result is aggregated over 3 independent runs. The standard derivation is also reported.}
	\label{tab:large_dataset}
	\centering
\begin{tabular}{|l|cc|cc|cc|}

\hline
	\multirow{2}{*}{Method} &\multicolumn{2}{|c|}{C100 $\rightarrow$ LSUN} & \multicolumn{2}{|c|}{C100$ \rightarrow$ C10} \\ \cline{2-5}
	&AUROC&AUPR &AUROC &AUPR  \\
\hline  
\multirow{1}{*}{AUEL+PE}  & $\textbf{79.06} \pm \scalebox{0.85}{0.45}$ & $\textbf{73.94} \pm \scalebox{0.85}{0.27}$ & $\textbf{86.11}\pm \scalebox{0.85}{0.99}$ & $\textbf{81.44}\pm \scalebox{0.85}{0.18}$ \\
\multirow{1}{*}{ESB}  & $78.42\pm \scalebox{0.85}{0.74}$ & $73.22\pm \scalebox{0.85}{1.20}$ & $82.44 \pm \scalebox{0.85}{0.86}$ & $77.15 \pm \scalebox{0.85}{1.13}$  \\
\multirow{1}{*}{Hyper-E}   & $78.60\pm \scalebox{0.85}{0.95}$ & $73.81\pm \scalebox{0.85}{0.08}$ & $82.97 \pm \scalebox{0.85}{1.81}$ & $78.03 \pm \scalebox{0.85}{1.47}$  \\
\multirow{1}{*}{Bayes-E}  & $78.25\pm \scalebox{0.85}{0.56}$ & $73.06\pm \scalebox{0.85}{1.70}$ & $82.22 \pm \scalebox{0.85}{0.72}$ & $76.98 \pm \scalebox{0.85}{0.84}$  \\
\hline
\end{tabular}\\

\begin{tabular}{|l|cc|cc|cc|}

\hline
	\multirow{2}{*}{Method} &\multicolumn{2}{|c|}{TIM $\rightarrow$ LSUN} & \multicolumn{2}{|c|}{TIM$ \rightarrow$ C10} \\ \cline{2-5}
	&AUROC&AUPR &AUROC &AUPR \\
\hline  
\multirow{1}{*}{AUEL+PE}  & $\textbf{86.89} \pm \scalebox{0.85}{0.24}$ & $83.15 \pm \scalebox{0.85}{0.78}$ & $\textbf{86.06}\pm \scalebox{0.85}{1.03}$ & $\textbf{82.17}\pm \scalebox{0.85}{0.61}$ \\
\multirow{1}{*}{ESB}  &$86.76 \pm  \scalebox{0.85}{1.19}$   & $\textbf{83.93} \pm \scalebox{0.85}{2.33}$ & $84.18 \pm \scalebox{0.85}{1.30}$ & $80.28 \pm \scalebox{0.85}{1.07}$ \\
\multirow{1}{*}{Hyper-E}  &$86.60 \pm  \scalebox{0.85}{0.97}$   & $81.95 \pm \scalebox{0.85}{2.07}$ & $85.70 \pm \scalebox{0.85}{0.76}$ & $82.12 \pm \scalebox{0.85}{1.11}$ \\
\multirow{1}{*}{Bayes-E}  &$86.14 \pm  \scalebox{0.85}{0.77}$   & $81.48 \pm \scalebox{0.85}{2.53}$ & $85.56 \pm \scalebox{0.85}{0.39}$ & $80.65 \pm \scalebox{0.85}{1.08}$ \\
\hline
\end{tabular}

\end{table*}

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.85\linewidth]{PE_shift_cifar100_all.pdf}
    \caption{Uncertainty calibration results for corrupted CIFAR-100 dataset with various metrics such as ECE, Brier Score, NLL, ACC, MCE, and AUROC.}
    \label{fig:cifar100_calibration}
\end{figure}
\clearpage
\begin{figure}[ht]
    \centering
    \includegraphics[width=0.85\linewidth]{PE_shift_tinyimagenet.pdf}
    \caption{Uncertainty calibration results for corrupted TinyImagenet dataset with various metrics such as ECE, Brier Score, NLL, ACC, MCE, and AUROC.}
    \label{fig:tiny_calibration}
\end{figure}

\section{Other Distributional Shifts}
Besides generating shifted distribution by adding Gaussian noise, we further perform different adversarial attacks and evaluate the robustness of our proposed methods. We generate adversarial samples $x_{adv}$ for each C10 testing image $x$ following the fast gradient sign method (FGSM) shown in Eq.~\eqref{FGSM}.
\begin{equation}
    \label{FGSM}
      x_{adv} =   x + \epsilon \text{sign}(\nabla_{  x}L(\theta,  x,  y))
\end{equation}
where $L$ is the NLL loss, $\epsilon$ is a hyperparameter indicating the perturbation level, sign($u$) is function that outputs 1 if $u\geq0$ and outputs $-1$ if $u<0$. Then we compute the ACC and NLL of our proposed methods on the perturbed images, compared to various ensemble baselines. Recently, there is a newly proposed type of attacks called uncertainty attacks, which try to optimize the uncertainty of the prediction. We replace the $L$ in Eq.~\eqref{FGSM} by the entropy of $p(y|x,\theta)$ to perform uncertainty attacks. The results are shown in Table \ref{tab:advaserial}, where "single" refers to single deterministic network. It indicates the effectiveness of our proposed methods against different adversarial attacks. Compared to ESB, ours can achieve significant improvement, especially when $\epsilon$ is small. 
\begin{table*}[h] 
% \fontsize{8.5}{9}\selectfont
	\caption{The ACC and NLL under different adversarial attacks on C10 dataset.}
	\label{tab:advaserial}
	\centering
\begin{tabular}{|l|cccccc|l|}
\hline
	\multirow{3}{*}{Method} & \multicolumn{6}{|c|}{Adversarial Attacks with FGSM using NLL Loss}\\ 
    \cline{2-7}
	& \multicolumn{2}{c}{$\epsilon=0.01$} & \multicolumn{2}{c}{$\epsilon=0.02$} & \multicolumn{2}{c|}{$\epsilon=0.05$} \\
% 	\cline{2-7}
 &ACC &NLL &ACC &NLL&ACC &NLL \\
\hline  
\multirow{1}{*}{Ours}  &$0.732 \pm \scalebox{0.85}{0.01}$&$\textbf{0.93} \pm \scalebox{0.85}{0.02}$&$\textbf{0.547} \pm \scalebox{0.85}{0.02}$&$\textbf{1.81} \pm \scalebox{0.85}{0.17}$&$\textbf{0.309} \pm \scalebox{0.85}{0.01}$&$\textbf{3.04} \pm \scalebox{0.85}{0.23}$\\
\multirow{1}{*}{Single}  &$0.534 \pm \scalebox{0.85}{0.13}$&$4.01 \pm \scalebox{0.85}{1.99}$&$0.367 \pm \scalebox{0.85}{0.10}$&$6.57 \pm \scalebox{0.85}{2.30}$&$0.206 \pm \scalebox{0.85}{0.10}$&$7.94 \pm \scalebox{0.85}{2.30}$\\
\multirow{1}{*}{ESB}  &$0.684 \pm \scalebox{0.85}{0.07}$&$1.31 \pm \scalebox{0.85}{0.52}$&$0.500 \pm \scalebox{0.85}{0.07}$&$2.56 \pm \scalebox{0.85}{0.99}$&$0.285 \pm \scalebox{0.85}{0.04}$&$3.98 \pm \scalebox{0.85}{1.16}$\\
\multirow{1}{*}{Hyper-E}  &$\textbf{0.733} \pm \scalebox{0.85}{0.01}$&$0.94 \pm \scalebox{0.85}{0.08}$&$0.546 \pm \scalebox{0.85}{0.02}$&$1.89 \pm \scalebox{0.85}{0.20}$&$\textbf{0.309} \pm \scalebox{0.85}{0.01}$&$3.21 \pm \scalebox{0.85}{0.30}$\\
\multirow{1}{*}{Bayes-E}  &$0.692 \pm \scalebox{0.85}{0.03}$&$1.14 \pm \scalebox{0.85}{0.18}$&$0.507 \pm \scalebox{0.85}{0.03}$&$2.15 \pm \scalebox{0.85}{0.35}$&$0.298 \pm \scalebox{0.85}{0.02}$&$3.46 \pm \scalebox{0.85}{0.57}$\\
\hline
\end{tabular}
\begin{tabular}{|l|cccccc|l|}
\hline
	\multirow{3}{*}{Method} & \multicolumn{6}{|c|}{Adversarial Attacks with FGSM using Uncertainty Loss}\\ 
    \cline{2-7}
	& \multicolumn{2}{c}{$\epsilon=0.01$} & \multicolumn{2}{c}{$\epsilon=0.02$} & \multicolumn{2}{c|}{$\epsilon=0.05$} \\
% 	\cline{2-7}
 &ACC &NLL &ACC &NLL&ACC &NLL \\
\hline  
\multirow{1}{*}{Ours}  &$\textbf{0.784} \pm \scalebox{0.85}{0.00}$&$\textbf{0.65} \pm \scalebox{0.85}{0.01}$&$0.582 \pm \scalebox{0.85}{0.00}$&$\textbf{1.52} \pm \scalebox{0.85}{0.02}$&$\textbf{0.328} \pm \scalebox{0.85}{0.00}$&$\textbf{2.88} \pm \scalebox{0.85}{0.01}$\\
\multirow{1}{*}{Single}  &$0.569 \pm \scalebox{0.85}{0.14}$&$3.10 \pm \scalebox{0.85}{1.72}$&$0.391 \pm \scalebox{0.85}{0.10}$&$5.61 \pm \scalebox{0.85}{2.08}$&$0.223 \pm \scalebox{0.85}{0.05}$&$7.34 \pm \scalebox{0.85}{1.14}$\\
\multirow{1}{*}{ESB}  &$0.733 \pm \scalebox{0.85}{0.07}$&$0.91 \pm \scalebox{0.85}{0.35}$&$0.535 \pm \scalebox{0.85}{0.07}$&$2.10 \pm \scalebox{0.85}{0.79}$&$0.298 \pm \scalebox{0.85}{0.04}$&$3.72 \pm \scalebox{0.85}{1.03}$\\
\multirow{1}{*}{Hyper-E}  &$0.783 \pm \scalebox{0.85}{0.01}$&$0.66 \pm \scalebox{0.85}{0.04}$&$\textbf{0.584} \pm \scalebox{0.85}{0.02}$&$1.57 \pm \scalebox{0.85}{0.15}$&$0.323 \pm \scalebox{0.85}{0.01}$&$3.03 \pm \scalebox{0.85}{0.26}$\\
\multirow{1}{*}{Bayes-E}  &$0.738 \pm \scalebox{0.85}{0.03}$&$0.82 \pm \scalebox{0.85}{0.11}$&$0.539 \pm \scalebox{0.85}{0.03}$&$1.78 \pm \scalebox{0.85}{0.25}$&$0.309 \pm \scalebox{0.85}{0.02}$&$3.24 \pm \scalebox{0.85}{0.40}$\\
\hline
\end{tabular}
\end{table*}

\begin{figure}[ht]
\centering
     \includegraphics[width=0.9\linewidth]{toy_examples.pdf}
     \caption{Synthetic data experiment. The first and second row represent the experiments for linear and nonlinear datasets, respectively. In the first column, the red dots are the training samples and the gray line is the ground truth mean of $p(y|x)$. In the second column, the ground truth input data density $p(x)$ is plotted as a function of $x$. The third column shows the normalized estimated epistemic uncertainty by PE as a function of input $x$.}
     \label{fig:syn_reg}
\end{figure}

\section{Synthetic Experiments} % (fold)

\label{sub:toy_datasets}
\paragraph{One-dimensional Regression Problems.}
To demonstrate the reliability of PE to quantify epistemic uncertainty, we provide some toy examples for one-dimensional regression problems $y=f(x) + \epsilon$ where $\epsilon$ is the noise term. We use both linear and nonlinear synthetic datasets. We first sample $x$ in a region of $[0, 6]$ from an exponential distribution $p(x) = e^{-x}$.  

Then for each $x_i$, the corresponding $y_i$ is sampled from $\mathcal{N}(x_i, 0.5)$ for the linear dataset and $\mathcal{N}(x_i^3, 0.5)$ for the nonlinear dataset. We obtain 600 pairs of $\{x_i,y_i\}$ as training data for both datasets. Given the training data, we build a ReLU network using two fully-connected layers and each layer has 20 hidden nodes. The neural network outputs $p(y|x,\theta) = \mathcal{N}(\mu(x,\theta),\sigma^2(x,\theta))$ where $\theta$ are the model parameters. In the probabilistic ensemble framework, we construct a 2-component PE model where the posterior distribution of parameters $p(\theta|\mathcal{D},\beta)$ is approximated by $\sum_{i=1}^2 0.5\mathcal{N}(\theta;\theta_i,\Sigma_i)$ through LA with equal weights. The epistemic uncertainty of $x$ can be calculated by
$$
Var_{p(\theta|\mathcal{D},\beta)}[E_{p(y|x,\mathbf{\theta})}[y]] \approx \frac{1}{M-1} \sum_{j=1}^m \left[\mu(x,\theta^{(j)})-\frac{1}{M}\sum_{j=1}^m \mu(x,\theta^{(j)})\right]^2
$$
where $\theta^{(j)} \sim \sum_{i=1}^2 0.5\mathcal{N}(\theta;\theta_i,\Sigma_i)$. The results are shown in Figure \ref{fig:syn_reg}.

From Figure 13, we can clearly see that the epistemic uncertainty is inversely correlated with training data density. Neural networks will not overfit the training data in the region $[0, 6]$ and training samples receive different epistemic uncertainties based on their density. 
\paragraph{Two-moon Dataset}
We also apply our method to the two-moon dataset. We generate 500 two-dimensional training data points using sklearn package with noise 0.1. For each ensemble component, we use a ReLU network using two fully-connected layers and each layer has 20 hidden nodes. The network outputs logits for this two-class classification problem. For constructing the probabilistic ensemble framework, we also construct a 2-component PE model where the posterior distribution of parameters $p(\theta|\mathcal{D},\beta)$ is approximated by $\sum_{i=1}^2 0.5\mathcal{N}(\theta;\theta_i,\Sigma_i)$ through LA with equal weights.

The results are shown in Figure \ref{fig:two_moon}, which also indicates that the estimated epistemic uncertainty inversely matches with the training data density. As shown by DUQ \citep{van2020uncertainty}, the deep ensemble method performs poorly on the two-moon dataset. In contrast, our method performs better than the deep ensemble method. 

\begin{figure}[ht]
\centering
     \includegraphics[width=0.35\linewidth]{two_moon.pdf}
     \caption{Synthetic data experiment on two-moon dataset. The red points are the training data. Darker regions indicate higher epistemic uncertainty.}
     \label{fig:two_moon}
\end{figure}

\clearpage
\bibliography{main}



\end{document}
