%% for initial submission
% \documentclass{uai2022}

%% after acceptance, for a revised version; also before submission to see how the non-anonymous paper would look like
\documentclass[accepted]{uai2022}

%% There is a class option to choose the math font
% ptmx math instead of Computer Modern (has noticable issues)
% \documentclass[accepted, mathfont=ptmx]{uai2022}

% newtx fonts (improves upon ptmx; less tested, no support)
% \documentclass[accepted, mathfont=newtx]{uai2022}
                                         
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}

\usepackage{bm}
\usepackage{amsfonts}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{multirow, booktabs}
\usepackage{caption}
\usepackage{subcaption}


% Define new commands
\DeclarePairedDelimiterX{\infdivx}[2]{(}{)}{#1\;\delimsize\|\;#2}
\newcommand{\YKL}{{\tt KL}\infdivx}
\newcommand\boldred[1]{\textcolor{red}{\textbf{#1}}}
\newcommand\boldblue[1]{\textcolor{blue}{\textbf{#1}}}
\newcommand\boldgreen[1]{\textcolor{green}{\textbf{#1}}}
\newcommand{\tpm}{\tiny{$\pm$}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Self-Distribution Distillation: Efficient Uncertainty Estimation}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<yf286@eng.cam.ac.uk>}{Yassir~Fathullah}{}}
\author[1]{\href{mailto:<mjfg@eng.cam.ac.uk>}{Mark~J.~F.~Gales}{}}

% Add affiliations after the authors
\affil[1]{%
    Engineering Department\\
    University of Cambridge\\
    UK
}
  
\begin{document}
\maketitle

\begin{abstract}
	Deep learning is increasingly being applied in safety-critical domains. For these scenarios it is important to know the level of uncertainty in a model’s prediction to ensure appropriate decisions are made by the system. Deep ensembles are the de-facto standard approach to obtaining various measures of uncertainty. However, ensembles often significantly increase the resources required in the training and/or deployment phases. Approaches have been developed that typically address the costs in one of these phases. 
	%
	In this work we propose a novel training approach, self-distribution distillation (S2D), which is able to efficiently train a single model that can estimate uncertainties. Furthermore it is possible to build ensembles of these models and apply hierarchical ensemble distillation approaches.
	%
	Experiments on CIFAR-100 showed that S2D models outperformed standard models and Monte-Carlo dropout. Additional out-of-distribution detection experiments on LSUN, Tiny ImageNet, SVHN showed that even a standard deep ensemble can be outperformed using S2D based ensembles and novel distilled models.
\end{abstract}

\section{Introduction}
\label{sec:intro}

% 1. Introduce the importance of uncertainty estimation and where it can be used.
Neural networks (NNs) have enjoyed much success in recent years achieving state-of-the-art performance on a large number of tasks within domains such as natural language processing \citep{transformers}, speech recognition \citep{speechrec} and computer vision \citep{cvision}. Unfortunately, despite the prediction performance of NNs they are known to yield poor estimates of the uncertainties in their predictions---in \textit{knowing what they do not know} \citep{random-seed-ensemble, calbirationmnn}. With the increasing application of neural network based systems in performing safety-critical tasks such as biometric identification \citep{biometric}, medical diagnosis \citep{medical} or fully autonomous driving \citep{kendalldriving}, it becomes increasingly important to be able to robustly estimate the uncertainty in a model's prediction. By having access to accurate measures of predictive uncertainty, a system can act in a more safe and informed manner.

Ensemble methods, and related schemes, have become the standard approach for uncertainty estimation. \citet{random-seed-ensemble} proposed generating a deep (random-seed) ensemble by training each member model with a different initialisation and stochastic gradient descent (SGD). Not only does this ensemble perform significantly better than a standard trained NN, it also displays better predictive uncertainty estimates. Although simple to implement, training and deploying an ensemble results in a linear increase in the computational cost. Alternatively \citet{mc-dropout} introduced the \textit{Monte Carlo (dropout) ensemble} (MC ensemble) which at test time estimates predictive uncertainty by sampling members of an ensemble using dropout. Though this approach generally does not perform as well as a deep ensemble (given the same computational power and neglecting memory) \citep{random-seed-ensemble}, it is significantly cheaper to train as it integrates the ensemble generation method into training.

Despite ensemble generation methods being computationally more expensive, they have an important ability to decompose predictive (total) uncertainty into \textit{data} and \textit{knowledge uncertainty} \citep{decomposition, mc-dropout}. 
%
Knowledge or \textit{epistemic} uncertainty refers to the lack of knowledge or ignorance about the most optimal choice of model (parameters) \citep{epistemic}.
%
As additional data is collected, the uncertainty in model parameters should decrease. This form of uncertainty becomes important whenever the model is tasked with making predictions for out-of-distribution data-points. For in-distribution inputs, it is expected that the trained model can return reliable predictions. On the other hand, data or \textit{aleatoric} uncertainty, represents inherent noise in the data being modelled, for example from overlapping classes. Even if more data is collected, this type of noise is inherent to the process and cannot be avoided or reduced \citep{prior-networks, mc-dropout, trust-uncertainty}. The ability to decompose and distinguish between these sources of uncertainty is important as it allows the cause of uncertainty in the prediction to be known. This in turn advises the user how the prediction should be used in downstream tasks \citep{bald, batchbald}.

% 4. Summary of contributions.
\textit{Summary of contributions}: In this work we make two important contributions to NN classifier training and uncertainty prediction. First we introduce \textit{self-distribution distillation} (S2D), a new general training approach that in an integrated, simultaneous fashion, trains a teacher ensemble and distribution distils the knowledge to a student. This integrated training allows the user to bypass training a separate expensive teacher ensemble while distribution distillation \citep{en2d} allows the student to capture the diversity and model a distribution over ensemble member predictions. Additionally, distribution distillation would give the student the ability to estimate both data and knowledge uncertainty in a single forward pass unlike standard NNs which inherently can not decompose predictive uncertainty, and unlike ensemble methods which can not perform the decomposition in a single pass. 
% 
Second, we train an ensemble of these newly introduced models and investigate different distribution distillation techniques giving rise to \textit{hierarchical distributions over predictions for uncertainty}. This approach is useful when there are no, or few, computational constraints in the training phase but still require robust uncertainties and efficiency in the deployment stage.


\section{Background}
\label{sec:background}

This section describes two techniques for uncertainty estimation. First, ensemble methods for predictive uncertainty estimation will be viewed from a Bayesian viewpoint. Second, a specific form of distillation for efficient uncertainty estimation will be discussed.

\subsection{Ensemble Methods}
\label{ssec:ensemble}

From a Bayesian perspective the parameters, $\bm\theta$, of a neural net are treated as random variables with some prior distribution $\tt{p}(\bm\theta)$. Together with the training data $\mathcal{D}$, this allows the posterior distribution $\tt{p}(\bm\theta\vert\mathcal{D})$ to be derived. To obtain the predictive distribution over all classes $y \in \mathcal{Y}$ (for some input $\bm x^*$), marginalisation over $\bm\theta$ is required:
\begin{align}
	{\tt P}(y \vert \bm x^*, \mathcal{D}) = \mathbb{E}_{ {\tt p}(\bm\theta\vert\mathcal{D}) } \Big[ {\tt P}(y \vert \bm x^*, \bm\theta) \Big] \nonumber
\end{align}
Since finding the true posterior is intractable, a variational approximation $\tt{p}(\bm\theta\vert\mathcal{D}) \approx \tt{q}(\bm\theta)$ is made \citep{jvi, approx1, approx2, swag}. Furthermore, marginalising over all weight values remains intractable leading to a sampling ensemble, approximation method \citep{mc-dropout, random-seed-ensemble}:
\begin{align}
	{\tt P}(y \vert \bm x^*, \mathcal{D}) \approx \frac{1}{M} \sum_{m=1}^{M} {\tt P}(y \vert \bm x^*, \bm\theta^{(m)}), \medspace \bm\theta^{(m)} \sim \tt{q}(\bm\theta) \nonumber
\end{align}
Here, an ensemble generation method is required to obtain the predictive distribution and uncertainty. Two previously mentioned approaches to generate an ensemble are deep (naive) random-seed and MC-dropout ensemble\footnote{In-depth comparisons of ensemble methods were conducted in \citet{trust-uncertainty, pitfalls}}. 
%
Deep ensembles are based on training $M$ models on the same data but with different initialisations leading to functionally different solutions. On the other hand, a MC-dropout ensemble explicitly defines a variational approximation through the hyper-parameters of dropout \citep{dropout} (used during training), allowing for straightforward sampling of model parameters. Another related technique, Model Soups \citep{soups} is based on fine-tuning an already trained model using different hyper-parameters \citep{hyperdeepens} to achieve some diversity, which saves compute due to fine-tuning often being cheaper than training a full model. Furthermore, SWA-Gaussian \citep{swag}, finds a Gaussian approximation based on the first two moments of stochastic gradient descent iterates. Unlike the deep ensemble approach, and similar to MC-dropout, this method allows for simple and efficient sampling but suffers from higher memory consumption. Even a diagonal Gaussian approximation requires twice the memory of a standard network. 

There also exists alternative memory and/or compute efficient ensemble approaches such as BatchEnsembles \citep{batchens} and MIMO \citep{mimo}. While the former approach is parameter efficient it requires multiple forward passes at test time similar to MC ensembles. The latter avoids this issue by generating independent subnetworks within a single deep model through the simultaneous "mixing" of multiple inputs and generation of multiple outputs. Although the training cost of such a system could be comparable to a deep ensemble \citep{mimo}, the inference cost is significantly lower. However, MIMO suffers from several drawbacks, one being the requirement of several input and output layers, which in large scale classification could consist of many millions of parameters. Finally, while many of the mentioned ensemble methods can straightforwardly be generalised to sequence tasks such as neural machine translation, MIMO presents a further challenge. It becomes a non-trivial task to "mix" input sequences of different lengths and address how this should be handled by sequence models such as transformers. 


\subsubsection{Predictive Uncertainty Estimation}
\label{sssec:unc}

Given an ensemble, the goal is to estimate and decompose the predictive uncertainty. First, the entropy of the predictive distribution ${\tt P}(y \vert \bm x^*, \mathcal{D})$ can be seen as a measure of total uncertainty. Second, this can be decomposed \citep{decomposition, whatunc} as:
\begin{align}\label{eq: ensemble uncertainty}
\begin{split}
	\overbrace{\mathcal{H}\big[{\tt P}(y \vert \bm x^*, \mathcal{D})\big]}^{\text{\textit{Total Uncertainty}}} \hspace{2mm}  = \hspace{1mm} 
	& \overbrace{\mathcal{I}\big[y, \bm\theta \vert \bm x^*, \mathcal{D}\big]}^{\text{\textit{Knowledge Uncertainty}}} \hspace{1mm} \\ 
	& \hspace{1.2mm} + \hspace{2mm} 
	\underbrace{\mathbb{E}_{{\tt p}(\bm\theta\vert\mathcal{D})}\big[ \mathcal{H}[{\tt P}(y \vert \bm x^*, \bm\theta)] \big]}_{\text{\textit{Data Uncertainty}}}
\end{split}
\end{align}
where $\mathcal{I}$ is mutual information and $\mathcal{H}$ represents entropy. This specific decomposition allows total uncertainty to be decomposed into separate estimates of knowledge and data uncertainty. Furthermore, the conditional mutual information can be rephrased as:
\begin{align}
	\mathcal{I}\big[y, \bm\theta \vert \bm x^*, \mathcal{D}\big] = \mathbb{E}_{{\tt p}(\bm\theta\vert\mathcal{D})}\Big[ 
	\YKL[\big]{ {\tt P}(y \vert \bm x^*, \bm\theta) }{ {\tt P}(y \vert \bm x^*, \mathcal{D}) }
	\Big] \nonumber
\end{align}
For an in-domain sample $\bm x^*$ the mutual information should be low as appropriately trained models ${\tt P}(y \vert \bm x^*, \bm\theta)$ should be close to the predictive distribution. High predictive uncertainty will only occur if the input exists in a region of high data uncertainty, for example when an input has significant class overlap. When the input $\bm x^*$ is out-of-distribution of the training data, one should expect inconsistent, different, predictions ${\tt P}(y \vert \bm x^*, \bm\theta)$ leading to a much higher knowledge uncertainty estimate.


\subsection{Ensemble Distillation Methods}
\label{ssec:end}

Ensemble methods have generally shown superior performance on a range of tasks but suffer from being computationally expensive. To tackle this issue, a technique called \textit{knowledge distillation} (KD) and its variants were developed for transferring the knowledge of an ensemble (teacher) into a single (student) model while maintaining good performance \citep{kd, sequence-kd, onlinekd, vadera}. This is generally achieved by minimising the KL-divergence between the student prediction and the predictive distribution of the teacher ensemble. In essence, KD trains a new student model to predict the average prediction of its teacher model. However from the perspective of uncertainty estimation the student model no longer has any information about the diversity of various ensemble member predictions; it was only trained to model the average prediction. Hence, it is no longer possible to decompose the total uncertainty into different sources, only the total uncertainty can be obtained from the student. To tackle this issue \textit{ensemble distribution distillation} (En2D) was developed \citep{en2d}. 


Let $\bm\pi$ signify a categorical distribution, that is $\pi_c = {\tt P}(y = \omega_c \vert \bm\pi)$. The goal is to directly model the space of categorical predictions $\{\bm\pi^{(m)} = \bm f(\bm x^*; \bm\theta^{(m)})\}_{m = 1}^{M}$ made by the ensemble. In work developed by \citet{en2d} this is done by letting a student model (with weights $\bm\phi$) predict the parameters of a Dirichlet, which is a continuous distribution over categorical distributions:
\begin{align}\label{eq: dirichlet}
	{\tt p}(\bm\pi \vert \bm x^*, \bm\phi) = {\tt Dir}(\bm\pi; \bm\alpha), \medspace \bm\alpha = \bm f(\bm x^*; \bm\phi)
\end{align}
The key idea in this concept is that we are not directly interested in the posterior ${\tt p}(\bm\theta\vert\mathcal{D})$ but how predictions $\bm\pi$ for particular inputs behave when induced by this posterior. Therefore, it is possible to replace ${\tt p}(\bm\theta\vert\mathcal{D})$ with a trained distribution ${\tt p}(\bm\pi \vert \bm x^*, \bm\phi)$. It is now necessary to train the student given the information from the teacher which is straightforwardly done using negative log-likelihood: 
\begin{align}\label{eq: mlll}
\mathcal{L}(\bm\phi) = -\frac{1}{M} \sum_{m = 1}^{M} \ln{\tt Dir}(\bm\pi^{(m)}; \bm\alpha)
\end{align}
A decomposable estimate of total uncertainty is then possible by using conditional mutual information between the class $y$ and prediction $\bm\pi$ \cite{prior-networks}:
\begin{align}\label{eq: dir uncertainty}
\begin{split}
	\overbrace{\mathcal{H}\big[{\tt P}(y \vert \bm x^*, \bm\phi)\big]}^{\text{\textit{Total Uncertainty}}} \hspace{2mm} = \hspace{1mm} 
	& \overbrace{\mathcal{I}\big[y, \bm\pi \vert \bm x^*, \bm\phi\big]}^{\text{\textit{Knowledge Uncertainty}}} \\
	& \hspace{1.2mm}  + \hspace{2mm} 
	\underbrace{\mathbb{E}_{{\tt p}(\bm\pi\vert\bm x^*, \bm\phi)}\big[ \mathcal{H}[{\tt P}(y \vert \bm \pi)] \big]}_{\text{\textit{Data Uncertainty}}}
\end{split}
\end{align}
This decomposition has a similar interpretation to eq. (\ref{eq: ensemble uncertainty}). Using a Dirichlet model, these uncertainties can be found using a single forward pass, achieving a much higher level of efficiency compared to an ensemble. Assuming this distillation technique is successful, the distribution distilled student should be able to closely emulate the ensemble and be able to estimate similar high quality uncertainties on both ID and OOD data.

However, ensemble distribution distillation is only applicable and useful when the ensemble members are not overconfident and display diversity in their predictions---there is no need in capturing diversity when there is none. For example, many state of the art convolutional neural networks are over-parameterised, display severe overconfidence and can essentially achieve perfect training accuracy which restricts the effectiveness of distribution distillation in terms of capturing the diversity in the teacher ensemble \citep{calbirationmnn, single-shot, large-scale-unc}. Furthermore, this method can only be used when an ensemble is available, leading to a high training cost. 


\section{Self-Distribution Distillation}
\label{sec:s2d}

\begin{figure*}[t!]
	\centering
	\includegraphics[width=0.90\textwidth]{figures/s2d.pdf}
	\caption{Dirichlet S2D model during training. Only the black part of the network is retained during inference, matching the behaviour of a standard model.}
	\label{fig:s2d}
\end{figure*}

In this section we propose \textit{self-distribution distillation} (S2D) for efficient training and uncertainty estimation, bypassing the need for a separate teacher ensemble. This combines:
\begin{itemize}
	\vspace{-2mm}
	\setlength\itemsep{-0.2mm}
	\item \textit{parameter sharing}: allowing the teacher and student to share a common feature extraction base would accelerate training significantly, each will branch off and have their own head;
	\item \textit{stochastic regularisation}: the teacher can generate multiple predictions efficiently by forward propagating an input through its head (with a stochastic regulariser) several times, emulating the behaviour of an ensemble;
	\item \textit{distribution distillation}: while the teacher branch is trained on cross-entropy, the student is taught to predict a distribution over teacher predictions capturing the diversity compactly.
\end{itemize}
\vspace{-2mm}

This process is summarised in Fig. \ref{fig:base}. 
\begin{figure}[h!]
	\centering
	\includegraphics[width=1.00\linewidth]{figures/base.pdf}
	\vspace{-1mm}
	\caption{General structure of a self-distribution distilled model. $M$ stochastic teacher branch forward propagations are trained on cross-entropy and simultaneously distribution distilled to the student.}
	\label{fig:base}
\end{figure}
The proposed approach can take many specific forms with regards to the type of feature extraction module, stochastic regulariser, teacher branch and student modelling choice. For example, the teacher could entail a much larger branch capturing complex patterns in the data, while the student could consist of a smaller branch used for compressing teacher knowledge into a more efficient form, at test time. On the other end, training efficiency can be achieved by forcing the teacher and student share the same branch parameters. 

In this work, we choose a highly efficient model configuration, shown in Fig. \ref{fig:s2d}. The main functional difference between the teacher and the student branches is the use of logit values, ${\bm z}$: for the teacher branch a probability is predicted; whereas the student uses the logits for a Dirichlet distribution. Furthermore the teacher uses stochastic regularisation techniques (SRTs) in generating multiple teacher predictions, analogous to an ensemble. In this work  multiplicative Gaussian noise (Gaussian dropout) with unit mean and uniformly random standard deviation is used. This form was chosen due to simplicity of sampling and possible ensemble diversity by simply controlling the level of variance in the noise. There is a wide range of other choices regarding what SRTs to use, from Bernoulli dropout, additive Gaussian noise to deciding at which teacher branch layers this should be introduced. Furthermore, since the Dirichlet distribution has bounded ability to represent diverse ensemble predictions \citep{en2d}, simply generating multiple teacher prediction by propagating through the last layer will not be the limiting factor in this model. To further improve the memory efficiency of the model, a single final linear layer shared by both student and teacher branches is used. This parameter sharing makes the S2D model efficient even when the number of classes is large, and does not use any more parameters compared to a standard model. Note any NN classifier can be cast into a self-distribution distillation format by inserting stochasticity prior to the final linear layer and can easily be combined with many other approaches such as MIMO \citep{mimo} and SWAG \citep{swag}. 

This choice of integrating ensemble teacher training and distribution distillation into a single entity utilising parameter tying also serves as a regulariser (optimising two objectives using the same set of weights) and allows for inexpensive training. The regularisation effect also arises from training the student on forward KL-divergence (a mode covering loss) both the student, and therefore teacher, will have smoother predictions. The only added training cost is from multiple forward passes through the final linear layer, a process which can easily be parallelised. Additionally, the restricted form of Fig. \ref{fig:s2d} brings some numerical stability. As noted by \citet{en2d}, optimising a student to predict a Dirichlet distribution can be unstable when there is a lack of common support between prediction and extremely sharp teacher outputs. However, note that teacher predictions are closely related to the expected student prediction:
\begin{align}
	\bm\pi^{(m)} & = {\tt Softmax}(\bm z^{(m)}), 
	\medspace\medspace
	\bm z = \mathbb{E}\big[ \bm z^{(m)} \big], \nonumber \\
	\mathbb{E}_{{\tt Dir}(\bm\pi; \bm\alpha)} \big[ \bm\pi \big] & = 
	\frac{\bm\alpha}{\alpha_0} = 
	{\tt Softmax}(\bm z) \nonumber
\end{align}
leading to increased common support. Additionally, multiplicative stochasticity in the teacher forces the outputs to have some diversity, mildly limiting overconfidence.

\subsection{Training Criteria and Temperature}
Now we train the teacher branch using cross-entropy, and simultaneously, use the teacher predictions to train the student branch. Let the weights of this model be denoted by $\bm\phi$ and say we have some input-target pair $(\bm x, y)$. The teacher
loss (for a single sample) is then:
\vspace{-3mm}
\begin{align}
	\mathcal{L}_{{\tt th}}(\bm\phi) = -\frac{1}{M} \sum_{m = 1}^{M} \sum_{c} \delta(y, \omega_c) \ln \pi^{(m)}_c \nonumber
\end{align}
where $\delta$ is the indicator function. The student branch could be trained using log-likelihood as in eq. (\ref{eq: mlll}) but it has been found that this approach could be unstable \citep{gec-distillation, large-scale-unc}. Instead we use the teacher categorical predictions in estimating a proxy teacher Dirichlet $\tilde{\bm\alpha}$ using maximum log-likelihood. The resulting student loss is KL-divergence based:
\begin{align}
	\mathcal{L}_{{\tt st}}(\bm\phi) & = \YKL[\Big]{{\tt Dir}(\bm\pi; \tilde{\bm\alpha})}{{\tt Dir}(\bm\pi; \bm\alpha)}, \nonumber\\
	\tilde{\bm\alpha} & = \argmax_{\hat{\bm\alpha}} \sum_{m} \ln{\tt Dir}(\bm\pi^{(m)}; \hat{\bm\alpha}) \nonumber
\end{align} 
The proxy Dirichlet is estimated using a numerical approach developed by \citet{minka}. The overall training loss becomes $\mathcal{L}(\bm\phi) = \mathcal{L}_{{\tt th}}(\bm\phi) + \mu \mathcal{L}_{{\tt st}}(\bm\phi)$ with a small constant $\mu$.

Deep learning models often overfit on training data leading to less informative outputs. To alleviate these issues we integrate temperature scaling in the student branch loss. While training the teacher branch predictions on cross-entropy we temperature scale the same predictions and use the resulting ones in estimating a proxy teacher Dirichlet. The student branch will repeatedly be taught to predict a smoother/wider Dirichlet distribution, while the teacher branch's objective is to maximise the probability of the correct class resulting in a middle ground. 


\begin{table*}
	\centering{}
	\begin{minipage}[t]{1.0\textwidth}%
		\begin{center}
			\caption{Test performance ($\pm$ 2 std) and compute cost. Dropout regularisation was only used for C100. Inference times (per input) were estimated using an NVIDIA V100 GPU. *SWAG inference speeds do not take into account the time to update batch norm statistics.} 
			\vspace{-2mm}
			\bgroup
			\def\arraystretch{1.00}
			\begin{adjustbox}{center}
				% \small
				\begin{tabular}{l|lll|lll|cc}
					\toprule
					\multirow{1}{*}{{ Dataset}} & 
					\multicolumn{3}{c|}{\multirow{1}{*}{{ C100}}} & 
					\multicolumn{3}{c|}{\multirow{1}{*}{{ C100+}}} & 
					\multicolumn{2}{c}{\multirow{1}{*}{{ Computational Cost}}}\\
					%
					\multicolumn{1}{l|}{\multirow{1}{*}{{ Model}}} & 
					\multicolumn{1}{c}{\multirow{1}{*}{{ Acc}}} & 
					\multicolumn{1}{c}{\multirow{1}{*}{{ NLL}}} &
					\multicolumn{1}{c|}{\multirow{1}{*}{{ \%ECE}}} &
					\multicolumn{1}{c}{\multirow{1}{*}{{ Acc}}} & 
					\multicolumn{1}{c}{\multirow{1}{*}{{ NLL}}} &
					\multicolumn{1}{c|}{\multirow{1}{*}{{ \%ECE}}} & 
					\multicolumn{1}{c}{\multirow{1}{*}{{ Params}}} & 
					% \multicolumn{1}{c}{\multirow{1}{*}{{ Train}}} & 
					\multicolumn{1}{c}{\multirow{1}{*}{{ Inference}}} \\
					\midrule
					{Individual} & 
					74.6  \tpm  0.5 &  1.11 \tpm 0.07 &  11.95 \tpm 1.65 & 
					77.5 \tpm 0.2 &  1.01 \tpm 0.14 &  10.84 \tpm 2.32 &
					\multirow{2}{*}{0.80M}  &  \multirow{2}{*}{2.3ms} \\ %  4.3h
					%
					{S2D Individual} & 
					\textbf{75.7}  \tpm  0.5 &  \textbf{0.87}  \tpm  0.02 &  \textbf{2.54}  \tpm  1.11  & 
					\textbf{78.1} \tpm 0.4 &  \textbf{0.81} \tpm 0.03 &  \textbf{4.35} \tpm 1.23 & & \\
					\midrule
					%
					%
					%
					{MIMO} & 
					75.2 \tpm 0.6 & 1.05 \tpm 0.13 & 10.51 \tpm 2.75 & 
					77.6 \tpm 0.7 & 0.89 \tpm 0.18 & 8.23 \tpm 3.90 &
					\multirow{2}{*}{0.83M} &  \multirow{2}{*}{2.3ms} \\ %  4.3h &
					%
					{S2D MIMO} & 
					\textbf{75.4} \tpm 0.1 & \textbf{0.90} \tpm 0.08 & \textbf{5.77} \tpm 1.63 & 
					\textbf{78.1} \tpm 0.6 & \textbf{0.80} \tpm 0.07 & \textbf{4.07} \tpm 0.43 & 
					& \\ %  4.5h &
					%
					%
					%
					\midrule 
					{SWAG-Diag} & 
					74.8 \tpm 1.0 &  1.08 \tpm 0.05 &  10.73 \tpm 1.31 & 
					77.7 \tpm 0.9 &  0.98 \tpm 0.03 &  9.60 \tpm 3.25 &
					\multirow{2}{*}{1.60M} &  \multirow{2}{*}{11.6ms*} \\ %   4.3h &
					%
					{S2D SWAG-Diag} & 
					\textbf{75.9} \tpm 0.6 &  \textbf{0.85} \tpm 0.03 &  \textbf{3.87} \tpm 0.88 & 
					\textbf{78.2} \tpm 1.3 &  \textbf{0.79} \tpm 0.07 &  \textbf{3.65} \tpm 0.62 & 
					& \\ %  4.4h &
					\midrule
					%
					%
					%
					{MC ensemble} & 
					75.6 \tpm 0.9 &  0.94 \tpm 0.04 &  6.67 \tpm 1.18 &
					- & - & - & 
					\multirow{2}{*}{0.80M} &  \multirow{2}{*}{11.5ms} \\ % &  4.3h 
					%
					{S2D MC ensemble} & 
					\textbf{76.6} \tpm 0.4 &  \textbf{0.83} \tpm 0.02 &  \textbf{2.57} \tpm 0.58  &
					- & - & - & 
					& \\ %  4.4h &
					\midrule
					%
					%
					%
					{Deep ensemble} & 
					79.3 &  0.76 &  \textbf{1.44} & 
					\textbf{82.1} &  0.66 &  \textbf{1.61} & 
					\multirow{2}{*}{4.00M} &   \multirow{2}{*}{11.5ms} \\ %  21.5h &
					%
					{S2D Deep ensemble} & 
					\textbf{79.7} &  \textbf{0.73} &  5.48 & 
					\textbf{82.1} &  \textbf{0.64} &  3.79 & 
					& \\ %  22.0h &
					\midrule
					%
					%
					%
					{EnD} & 
					\textbf{77.9} &  0.91 &  10.36 & 
					\textbf{81.2} &  0.81 &  9.51 & 
					\multirow{2}{*}{ 0.80M} & \multirow{2}{*}{2.3ms} \\ %  24.4h & 
					{H2D-Dir} &
					77.7 &  0.84 &  3.24 & 
					80.9 &  0.71 &  3.42 & 
					&  \\ %  24.0h &
					{H2D-Gauss} & 
					77.5 &  \textbf{0.77} &  \textbf{1.39} & 
					80.5 &  \textbf{0.68} &  \textbf{2.41} & 
					0.83M &  2.4ms \\ %  24.3h &
					\bottomrule
				\end{tabular}
			\end{adjustbox}
			\egroup
			\label{tab:all-models-v1}
			\par\end{center}
	\end{minipage}
\end{table*}


\section{Self-Distribution Distilled Ensemble Approaches}

If computational resources during the training phase are not constrained it would open up the possibility for self-distribution distilled ensembles and various hierarchical distillation approaches of such models. First it can be noted that the ensemble generation methods mentioned in previous sections can easily be used with the S2D models in the previous section. The predictive distribution of such an ensemble would take the following form:
\begin{align}
	{\tt P}(y = \omega_c \vert \bm x^*, \mathcal{D}) & = \mathbb{E}_{{\tt p}(\bm\phi \vert \mathcal{D})} \left[ \mathbb{E}_{{\tt p}(\bm\pi \vert \bm x^*, \bm\phi)} \left[ {\tt P}(y = \omega_c \vert \bm\pi) \right] \right] \nonumber \\
	&  = \mathbb{E}_{{\tt p}(\bm\phi \vert \mathcal{D})} \left[ \frac{\alpha_c}{\alpha_0} \right] = \frac{1}{M} \sum_{m = 1}^{M} \frac{\alpha^{(m)}_c}{\alpha^{(m)}_0} \nonumber
\end{align}
Furthermore, an ensemble of Dirichlet models can be used to estimate similar uncertainty measures as previously described:
\begin{align}
	\mathcal{H}\big[{\tt P}(y \vert \bm x^*, \mathcal{D})\big] = 
	& \hspace{1mm} \mathcal{I}\big[y, \bm\pi \vert \bm x^*, \mathcal{D}\big] \nonumber \\ 
	& + \mathbb{E}_{{\tt p}(\bm\phi\vert\mathcal{D})}\Big[ \mathbb{E}_{{\tt p}(\bm\pi\vert\bm x^*, \bm\phi)}\big[ \mathcal{H}[{\tt P}(y \vert \bm \pi)] \big] \Big] \nonumber
\end{align}
This is a generalisation of eq. (\ref{eq: dir uncertainty}) since specific weights $\bm\phi$ have been replaced with conditioning on the dataset $\mathcal{D}$. Computing these uncertainties requires only a few modifications compared to the standard ensemble in eq. (\ref{eq: ensemble uncertainty}).


\subsection{Hierarchical Distribution Distillation}

Next, the most natural step is to transfer the knowledge of an S2D (Dirichlet) ensemble into a single model. A choice needs to be made regarding the hierarchy of student modelling: should the student predict a categorical\footnote{Since transferring knowledge from a Dirichlet ensemble into a student predicting a categorical critically loses information about diversity, this method will not be investigated.}, Dirichlet, or a distribution over Dirichlets---hereby given the family name \textit{hierarchical distribution distillation} (H2D). Initially we start by training a student model to predict a single Dirichlet identical to eq. (\ref{eq: dirichlet}). However, since the S2D ensemble provides, for an input $\bm x^*$, a set of Dirichlets $\{\bm\alpha^{(m)} = \bm f(\bm x^*; \bm\phi^{(m)})\}_{m = 1}^{M}$ a modified distillation criterion is needed:
\begin{align}
	\mathcal{L}(\bm\phi) = \frac{1}{M}\sum_{m = 1}^{M} \YKL[\Big]{{\tt Dir}(\bm\pi; \bm\alpha^{(m)})}{{\tt Dir}(\bm\pi; \bm\alpha)} \nonumber
\end{align}
where $\bm\alpha = \bm f(\bm x^*; \bm\phi)$. This KL-divergence based loss also allows the reverse KL criterion to be used \citep{revKL} if desired. One criticism of this form of model,  Dirichlet H2D (H2D-Dir), is that the diversity across ensemble members is lost, similar to the drawback in standard distillation. Therefore, we seek a distribution over Dirichlets to capture this higher level of diversity.

To model the space of Dirichlets we need to define a distribution over the parameters. Here we are faced with a choice: (1) model the parameters $\bm\alpha \in \mathbb{R}_+^{K}$ directly (restricted to the non-negative real space) or (2) apply a transformation to simplify the modelling. Here a logarithmic transformation $\bm z = \ln \bm\alpha \in \mathbb{R}^{K}$ is applied and a simple distribution over the Dirichlet parameters, a diagonal Gaussian, to be used (see Appendix \ref{sec:behaviour} for a justification for this modelling choice). With these building blocks, the goal of H2D is to train a student model with weights $\bm\lambda$ and predict the parameters of a diagonal Gaussian $(\bm\mu, \bm\sigma)$ (H2D-Gauss):
\begin{align}
	{\tt p}(\ln\bm\alpha \vert \bm x^*, \bm \lambda) & = \mathcal{N}(\ln\bm\alpha; \bm\mu, \bm\sigma^2) = \prod_{c = 1}^{K} \mathcal{N}(\ln\alpha_c; \mu_c, \sigma_c^2) \nonumber
\end{align} 
where $\bm\mu, \bm\sigma = \bm f(\bm x^*; \bm\lambda)$. By sampling from this Gaussian, one can obtain multiple Dirichlet distributions similar to, but cheaper than, an S2D ensemble. Clearly, the flexibility of such a model can easily be extended by allowing the model to predict a fully specified covariance, however due to computational tractability  only diagonal covariance models are used in this work. Note that a secondary head is required for such a model. In a similar fashion to previous approaches, this model can be trained using negative log-likelihood or by estimating a proxy teacher Gaussian and use KL-divergence. In this work we have adopted the proxy approach, see Appendix \ref{ssec:proxy} for details.

\section{Experimental Evaluation}
\label{sec:experiment}

\begin{table*}
	% \centering{}
	\caption{OOD detection results (LSUN resize) trained on C100. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: ood lsun r}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.00}
	\begin{adjustbox}{center}
		% \small
		\begin{tabular}{l|llll|llll}
			\toprule
			\multirow{2}{*}{{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			& Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			\midrule
			{Individual} & 
			77.3 \tpm 0.9 & 79.8 \tpm 0.9 & & & 
			74.2 \tpm 1.1 & 76.9 \tpm 1.2 & & \\
			{S2D Individual} & 
			78.4 \tpm 2.3 & 80.7 \tpm 3.2 & 80.8 \tpm 3.1 & 80.0 \tpm 4.2 &
			75.4 \tpm 2.5 & 78.3 \tpm 3.5 & 79.5 \tpm 3.5 & 75.5 \tpm 3.8 \\
			\midrule
			MIMO & 
			78.5 \tpm 1.2 & 80.5 \tpm 1.4 & 80.6 \tpm 1.4 & 75.0 \tpm 2.8 & 
			75.0 \tpm 1.4 & 78.0 \tpm 1.6 & 78.1 \tpm 1.6 & 67.0 \tpm 3.5 \\ 
			S2D MIMO & 
			80.6 \tpm 4.1 & 81.4 \tpm 4.4 & 81.4 \tpm 4.4 & 81.3 \tpm 4.2 & 
			76.6 \tpm 5.2 & 78.8 \tpm 5.4 & 80.3 \tpm 5.4 & 77.7 \tpm 5.3 \\ 
			\midrule
			SWAG-Diag & 
			78.5 \tpm 1.0 & 80.5 \tpm 1.2 & 80.6 \tpm 1.3 & 75.2 \tpm 0.8 & 
			75.0 \tpm 1.4 & 78.1 \tpm 1.7 & 78.3 \tpm 1.8 & 67.1 \tpm 1.0 \\
			S2D SWAG-Diag & 
			78.7 \tpm 2.3 & 80.9 \tpm 2.8 & 81.1 \tpm 2.7 & 80.9 \tpm 3.8 & 
			75.4 \tpm 2.7 & 78.4 \tpm 3.6 & 79.7 \tpm 3.2 & 76.2 \tpm 4.1 \\
			\midrule
			{MC ensemble} & 
			76.6 \tpm 0.8 & 78.3 \tpm 0.8 & 78.9 \tpm 0.8 & 72.4 \tpm 1.2 & 
			72.2 \tpm 1.0 & 74.6 \tpm 1.6 & 75.6 \tpm 1.7 & 64.2 \tpm 2.0 \\
			{S2D MC ensemble} & 
			77.7 \tpm 0.9 &   79.8 \tpm 1.5 &   80.5 \tpm 1.1 &   78.1 \tpm 2.9 & 
			73.7 \tpm 1.0 &   76.1 \tpm 1.7 &   78.6 \tpm 1.3 &   72.0 \tpm 3.2\\
			\midrule
			{Deep ensemble} & 
			81.1 & 82.9 & 83.4 & 79.2 & 77.7 & 80.4 & 81.2 & 73.6\\
			{S2D Deep Ensemble} & \textbf{82.4} & \textbf{84.8} & 85.0 & 83.5 & \textbf{79.5} & \textbf{82.5} & 83.9 & 78.7 \\
			\midrule
			{EnD} & 79.4 &  81.0 & & & 75.8 & 78.2 & & \\
			{H2D-Dir} & 80.3 & 83.2 & 83.4 & \boldblue{86.4} & 77.9 & 81.9 & 81.9 & \textbf{83.4} \\
			{H2D-Gauss} & 80.8 & 83.9 & \textbf{85.7} & 80.7 & 78.2 & 82.0 & \boldblue{85.8} & 76.0 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table*}


This section investigates the self-distribution distillation approach on classifying image data. First, this approach is compared to standard trained models and established ensemble based methods (deep ensembles and MC-dropout) as well as the diagonal version of SWAG (SWAG-Diag) and MIMO. Second, self-distribution distillation is combined with all above mentioned approaches. Finally, knowledge distillation is compared to hierarchical distribution distillation of Dirichlet ensembles.

This comparison is based on two sets of experiments. The first set compares the performance of all baselines and proposed models in terms of image classification performance and calibration on CIFAR-100 \citep{cifar} without (C100) and with (C100+) a data augmentation scheme. The second set of experiments compares the out-of-distribution/domain (OOD) detection performance using various unseen datasets such as LSUN \citep{lsun}, Tiny ImageNet \citep{tim} and SVHN \citep{svhn}.

All experiments are based on training DenseNet-BC (k = 12) models with a depth of 100 \citep{densenet}. For ensemble generation methods $M = 5$ models were sampled (in the case of MC-dropout ensembles and SWAG) or trained (in the case of deep ensembles). For MIMO we use two output heads ($M = 2$) due to limited capacity in the chosen model \citep{mimo}. Note that for this choice of model it was not possible to use ensemble distribution distillation since DenseNet-BC models display high confidence on the training data of CIFAR-100 causing instability in distillation.
%
All single model training runs were repeated 5 times; mean $\pm$ 2 standard deviations are reported. The experimental setup and additional experiments are described in Appendix \ref{sec:config}-\ref{sec:wrn}.

\subsection{CIFAR-100 Classification Performance Experiments}

The first batch of experiments show the classification performance using a range of metrics such as accuracy, negative log-likelihood (NLL) and expected calibration error (ECE), see Table \ref{tab:all-models-v1}. Perhaps the most noteworthy result is the improvement in all metrics and datasets of a self-distribution distilled model compared to its standard counterpart.  The improvement is more than 2 standard deviations. A similar picture can be observed for the S2D versions of SWAG-Diag and MC-dropout which, without any notable gain in cost of training and inference, improve upon their equivalent standard counterparts in all metrics. Regarding MIMO a small gain can still be observed when switching to the self-distribution distillation framework but this boost is smaller. Finally for the deep ensemble approach, the S2D version only shows a marginal improvement in accuracy and NLL but a notable increase in ECE. In fact, it is observed that ensembling standard and S2D models reduces and increases ECE respectively. This trend is associated with the level of ensemble calibration. Unlike a standard deep ensemble, the members of the S2D counterpart are close to being calibrated, displaying little to no overconfidence. Ensembling these calibrated models lead to under-confident average predictions hence, the increased calibration error. Note, calibration error and negative log-likelihood can easily be reduced for in-domain data, post-training, by temperature scaling predictions.

The next set of comparisons regard various distilled models, see the final block of Table \ref{tab:all-models-v1}. As expected they all perform in between the performance of an individual model and the deep ensemble. While standard ensemble distillation (knowledge distillation) was found to consistently achieve better accuracy than other distillation methods, this success was highly dependent on the value of temperature scaling used. A sub-optimal choice of temperature can drastically reduce performance. On the other hand, when distilling an S2D ensemble, no additional hyper-parameters are needed. We observe that while both H2D-Dir and H2D-Gauss obtained a higher NLL they also achieved better calibration than their S2D ensemble teacher. 
%%
Lastly, one can observe that H2D-Dir and H2D-Gauss both outperform the standard SWAG-Diag and MC-dropout ensemble while using only a single forward pass. Although these distilled models involve an expensive training phase (a teacher ensemble is required) they are able to, at test time, achieve much higher computational efficiency and estimate and decompose total uncertainty. 


\subsection{Out-of-distribution Detection Experiments}
\label{ssec:ood}

\begin{table*}
	% \centering{}
	\caption{OOD detection results (SVHN) trained on C100. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: ood svhn}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.04}
	\begin{adjustbox}{center}
		% \small
		\begin{tabular}{l|llll|llll}
			\toprule
			\multirow{2}{*}{{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			& Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			\midrule
			{Individual} & 
			79.7 \tpm 5.6 & 81.8 \tpm 6.0 & & & 
			88.3 \tpm 3.6 & 89.6 \tpm 3.9 & & \\
			{S2D Individual} & 
			83.0 \tpm 2.9 & 86.0 \tpm 2.2 & 87.7 \tpm 2.2 & 81.2 \tpm 3.8 & 
			90.6 \tpm 1.7 & 92.0 \tpm 1.6 & 94.4 \tpm 1.1 & 86.1 \tpm 3.3 \\
			\midrule
			MIMO & 
			81.8 \tpm 4.1 & 84.3 \tpm 4.5 & 84.3 \tpm 4.5 & 80.9 \tpm 5.3 & 
			89.9 \tpm 2.5 & 91.4 \tpm 2.8 & 91.4 \tpm 2.8 & 88.2 \tpm 3.1 \\ 
			S2D MIMO & 
			84.1 \tpm 2.3 & 87.2 \tpm 2.1 & 87.4 \tpm 2.1 & 83.7 \tpm 1.8 & 
			89.6 \tpm 1.8 & 92.9 \tpm 1.6 & 93.2 \tpm 1.6 & 90.4 \tpm 1.3 \\ 
			\midrule
			SWAG-Diag & 
			81.4 \tpm 3.0 & 83.5 \tpm 3.6 & 83.5 \tpm 3.4 & 80.5 \tpm 4.9 & 
			89.2 \tpm 2.6 & 90.2 \tpm 3.2 & 90.2 \tpm 3.1 & 88.3 \tpm 3.6 \\ 
			S2D SWAG-Diag & 
			83.2 \tpm 2.7 & 86.3 \tpm 2.6 & 87.7 \tpm 2.5 & 82.7 \tpm 4.3 & 
			90.7 \tpm 1.7 & 92.3 \tpm 1.8 & 94.3 \tpm 1.4 & 87.3 \tpm 3.2 \\ 
			\midrule
			{MC ensemble} & 
			79.0 \tpm 4.3 & 81.6 \tpm 4.7 & 83.1 \tpm 4.6 & 68.3 \tpm 3.0 & 
			88.1 \tpm 2.8 & 89.3 \tpm 3.3 & 90.7 \tpm 3.1 & 77.4 \tpm 1.8 \\
			{S2D MC ensemble} & 
			82.3 \tpm 4.3 &   85.9 \tpm 4.1 &   88.4 \tpm 3.5 &   79.7 \tpm 6.1 &
			90.5 \tpm 2.6 &   92.1 \tpm 2.7 &   95.0 \tpm 1.7 &   85.4 \tpm 4.2\\
			\midrule
			{Deep ensemble} & 84.5 & 87.2 & 86.8 & 85.0 & 91.3 & 92.5 & 92.2 & \textbf{91.5} \\
			{S2D Deep ensemble} & \textbf{86.5} & \textbf{89.9} & \boldblue{91.7} & 85.1 & \textbf{92.6} & \textbf{94.1} & \boldblue{96.2} & 88.4\\
			\midrule
			{EnD} & 78.0 & 79.8 & & & 87.0 & 87.9 & & \\
			{H2D-Dir} & 84.6 & 88.4 & 88.5 & \textbf{87.6} & 91.7 & 93.6 & 91.7 & 90.6 \\
			{H2D-Gauss} & 81.2 & 85.3 & 90.1 & 74.5 & 90.0 & 91.4 & 95.9 & 81.7 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table*}

The second batch of experiments investigate the out-of-distribution detection performance of models. The goal is to differentiate between two types of data, negative in-distribution (ID, sampled from the same source as the training data) and positive out-of-distribution (OOD) data. 

In all experiments the models were trained on C100. The ID data was always set to the test set of C100 and OOD data was the test set of LSUN/TIM/SVHN. Both LSUN and TIM examples had to be resized or randomly cropped as preprocessing before being fed to the model. The detection was done using four uncertainty estimates: confidence, total uncertainty (TU), data or aleatoric uncertainty (DU) and knowledge or epistemic uncertainty (KU). Performance was measured using the threshold independent AUROC \citep{auroc} and AUPR \citep{aupr} metrics. Due to limited space, some LSUN and TIM experiments have been moved to Appendix \ref{ssec:tim}.

First, there is not a single case in Tables \ref{table: ood lsun r} and \ref{table: ood svhn} where an individual model, MIMO, SWAG-Diag or MC-dropout ensemble is able to outperform the detection performance of a single S2D model. This statement holds for all the analysed uncertainties apart from confidence where both MIMO and SWAG-Diag are insignificantly better. When comparing to a deep ensemble, the S2D model is outperformed in many cases. The general trend is that the ensemble is able to output marginally higher quality confidence and total uncertainty estimates in most datasets, but that S2D sometimes outperforms the ensemble when using data uncertainty (as in Table \ref{table: ood svhn}). 

Interestingly, the MC ensemble seems to degrade the quality of confidence and total uncertainty when compared to its standard individual counterpart. However, since a MC-dropout ensemble can estimate data uncertainty, it is able to outperform the standard model overall. Similarly, the S2D MC ensemble generally has inferior detection performance compared to its single deterministic model equivalent. The only exception is in detecting SVHN where the ensemble has marginally better data uncertainty estimates. Regarding SWAG-Diag and MIMO they both gain from being cast into a self-distribution distillation viewpoint drastically increasing their detection performance without additional cost at inference. 

Although the S2D deep ensemble, when compared to its vanilla counterpart, wasn't able to show any noticeable accuracy boost (on CIFAR-100) it does outperform in this detection task. The only case where the S2D ensemble was not able to outshine the vanilla ensemble is when both use knowledge uncertainty to detect SVHN examples using the AUPR metric. Generally, S2D based systems outperform their standard counterparts. 

Regarding distillation based approaches, it is observed that knowledge ensemble distillation, EnD, is able to outperform the standard model in all cases except SVHN detection, and in no case is able to reach the deep ensemble performance, which it was distilled from. On the other hand, both the H2D-Dir and H2D-Gauss models outperform the distilled model and are able to decompose predictive uncertainty. Specifically we discover that H2D-Dir is able to generate the highest quality knowledge uncertainty estimates in almost all cases, and is able to outperform its S2D ensemble teacher using this uncertainty. The H2D-Gauss model however, was not able to boast similar high quality knowledge uncertainty. Instead, this model displayed the generally best performing data uncertainty estimates, able to outperform the vanilla deep ensemble in all cases, and the S2D equivalent in all but SVHN detection. 


\section{Conclusion}

Uncertainty estimation within deep learning is becoming increasingly importance, with deep ensembles being the standard for estimating various sources of uncertainty. However, ensembles suffer from significantly higher computational requirements. This work proposes \textit{self-distribution distillation} (S2D), a novel collection of approaches for directly training models able to estimate and decompose predictive uncertainty, without explicitly training an ensemble, and can seamlessly be combined with other approaches. Additionally, if one is not resource  restricted during the training phase, a novel approach, \textit{hierarchical distribution distillation} (H2D), is described for transferring/distilling the knowledge of S2D style ensembles into a single flexible and robust student model. 
%
It is shown that S2D models are able to outperform standard models and rival MC ensembles on the CIFAR-100 test set. Additionally, S2D is able to estimate higher quality uncertainty estimates compared to standard models and MC ensembles and in most cases, able to better detect out-of-distribution images from the LSUN, SVHN and TIM datasets. Combination of S2D with other promising approaches such as MIMO and SWAG also show additional gains in accuracy and detection performance. S2D is also able to rival the deep ensemble in certain cases even though it only requires a single forward pass. Furthermore, S2D deep ensembles and H2D derived student models are shown to notably outperform the deep ensemble in almost all detection problems. 
%
These promising results show that the efficient self-distribution and novel hierarchical distribution distillation approaches have the potential to train robust uncertainty estimating models able to outperform deep ensembles.
%
Future work should further investigate self-distribution distillation in other domains such as natural language processing and speech recognition. The need for more efficient uncertainty estimation is especially useful for these areas as they often utilise large-scale models. Furthermore, one could also analyse variations of S2D such as utilising less weight sharing, generating more diverse teacher predictions or changing the student modelling choices. 



\clearpage
\bibliography{uai2022-conference}


\clearpage
\appendix
\onecolumn

% Experiments that could be useful:
% Add WideResNet experiments if needed
% Add CIFAR-10 experiments if needed
% Analyse correlation between uncertainty metrics of various models (see En2D paper)
% Analyse correlation between ensemble size and performance of uncertainty
% Use gaussian self-distribution distillation
% Use different stochastic regularisation techniques in S2D models
% Use temperature scaling in H2D-Dir


\section{Experimental Configuration}
\label{sec:config}
% Detail information about each dataset
% Detail information about each individual model
% Detail information about distillation
% Detail information about hierarchical distribution distillation

\begin{table}[h!]
	\centering{}
	\begin{center}
		\caption{Description of datasets used in training and evaluating models.} 
		\vspace{-2mm}
		\def\arraystretch{1.08}
		\begin{tabular}{l|c|c|c}
			\toprule
			Dataset & Train & Test & Classes \\
			\midrule
			CIFAR-100 & 50000 & 10000 & 100 \\
			LSUN & - & 10000 & 10 \\
			SVHN & - & 26032 & 10 \\
			Tiny ImageNet & - & 10000 & 200 \\
			\bottomrule
		\end{tabular}
		\label{tab:datasets}
		\par\end{center}
\end{table}

All models were trained on the CIFAR-100 dataset, with and without data augmentation. The augmentation scheme involves randomly mirroring and shifting images following \citet{daug-1, daug-2}. Remaining datasets were used as out-of-distribution samples in the detection task. 

All individual models, and ensemble members were based of off the DenseNet-BC ($k = 12$, 100 layers) architecture and trained according to \citet{densenet}. SWAG-Diag was obtained by checkpointing the weights of the last 20 epochs with a reduced learning rate of $\eta = 1.0 \times 10^{-4}$. MIMO with two output heads was trained using the same setup as for the standard model. To keep training costs comparable to (S2D) individual models, no batch or input repetition was used \citep{mimo}. Similarly all self-distribution distilled equivalents were trained with identical training recipes with the addition of a student loss ($\mu = 1.28 \times 10^{-4}$).

Regarding distilled based models, the EnD baseline was trained using negative log-likelihood using the average temperature scaled prediction of the teacher ensemble, with $T \in \{1.0, 2.0, 3.0, 4.0, 5.0\}$. For the hierarchical distribution distillation approaches the students were first initialised with the weights of an S2D model trained for 150 epochs, for increased stability. Thereafter, each student was trained using the appropriate H2D criteria with a significantly reduced learning rate. H2D-Dir was trained using $\eta = 5.0 \times 10^{-5}$ for an additional 150 epochs. H2D-Gauss required an initial learning rate of $\eta = 5.0 \times 10^{-3}$ which was reduced by a factor of 2 after 75 and 150 epochs. It was trained for 170 epochs. Additionally, uncertainties were computed by generating 50 samples from each Gaussian prediction, since this modelling choice does not result in closed form expressions.

\subsection{Proxy Target Training}
\label{ssec:proxy}
% Describe how proxy targets are estimated

Since the use of negative log-likelihood can be unstable in training S2D and distilling H2D models we utilise proxy targets and KL-divergence. It has already been mentioned that the proxy target in S2D follows:
\begin{align}
\tilde{\bm\alpha} = \argmax_{\hat{\bm\alpha}} \sum_{m} \ln{\tt Dir}(\bm\pi^{(m)}; \hat{\bm\alpha}), \medspace\medspace \bm\pi^{(m)} = {\tt Softmax}(\bm z^{(m)}, T)
\end{align}
Each categorical prediction will be temperature scaled, with $T = 1.5$, to mitigate overconfident predictions. While H2D-Dir does not require any proxy targets, the Gaussian equivalent does. The proxy diagonal Gaussian, estimated according to maximum log-likelihood, has a closed-form expression:
\begin{align}
\tilde{\bm\mu} = \frac{1}{M} \sum_{m = 1}^{M} \ln\bm\alpha^{(m)}, \medspace\medspace \tilde{\bm\sigma}^2 = \frac{1}{M} \sum_{m = 1}^{M} (\ln\bm\alpha^{(m)} - \tilde{\bm\mu})^2
\end{align}
where $\bm v^2 = \bm v \odot \bm v$ represents an element-wise multiplication. This is then used in a KL-divergence based loss, training the student with prediction $\bm\mu, \bm\sigma$ according to:
\begin{align}
\YKL[\Big]{\mathcal{N}(\bm z; \tilde{\bm\mu}, \tilde{\bm\sigma}^2)}{\mathcal{N}(\bm z; \bm\mu, \bm\sigma^2)} = \sum_{c = 1}^{K} \ln\Big(\frac{\sigma_c}{\tilde{\sigma}_c}\Big) + \frac{\tilde{\sigma}_c^2 + (\mu_c - \tilde{\mu}_c)^2}{2 \sigma_c^2} - \frac{1}{2}
\end{align}
Note however, that the proxy targets are detached from any back gradient propagation calculations. This is to simulate typical teacher-student knowledge transfer where teacher weights are kept fixed during student training.

\newpage
\section{Out-of-distribution Detection}
% Describe missing experiments 
% Describe that ensemble size is on the smaller side so additional experiments will be run

This section covers remaining out-of-distribution detection experiments. First, we cover the LSUN and Tiny ImageNet detection problem for all models considered in section \ref{ssec:ood}. Thereafter, additional experiments will be run on ensembles of various sizes. This is to investigate if the low quality of knowledge uncertainty estimates is caused by a limited number of ensemble members.

\subsection{Tiny ImageNet Experiments}
\label{ssec:tim}
% Describe additional tim experiments similar to prior results

Similar to the results section \ref{ssec:ood} the S2D Deep ensemble and H2D-Gauss outperformed all other models, see Table \ref{table: ood tim r} and \ref{table: ood tim c}. The only exception is the use of confidence on resized TIM with the AUROC metric where the Deep ensemble marginally outperforms the S2D equivalent. However, unlike previous results, knowledge uncertainty seems to perform on par with or outperform confidence. The one exception is the MC ensemble.


\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (LSUN random crop) trained on C100. \textbf{Best} in column and \boldblue{best} overall.} 
	\label{table: ood lsun c}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.12}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}
			\toprule
			\multirow{2}{*}{\textbf{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			& Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			\midrule
			{Individual} &
			83.2 \tpm 2.1 & 85.7 \tpm 4.5 & & &
			79.4 \tpm 5.6 & 83.0 \tpm 5.9 & & \\
			{S2D Individual} & 
			85.4 \tpm 4.5 & 88.9 \tpm 4.1 & 90.3 \tpm 4.0 & 84.1 \tpm 4.8 & 
			81.9 \tpm 6.2 & 86.6 \tpm 5.7 & 90.3 \tpm 5.0 & 76.0 \tpm 5.1 \\
			\midrule
			MIMO & 
			83.3 \tpm 3.9 & 86.2 \tpm 4.2 & 86.3 \tpm 4.3 & 80.9 \tpm 1.6 & 
			79.6 \tpm 6.6 & 83.8 \tpm 6.8 & 83.8 \tpm 6.9 & 72.4 \tpm 3.7 \\ 
			S2D MIMO & 
			85.8 \tpm 2.5 & 89.5 \tpm 2.8 & 90.7 \tpm 2.8 & 85.5 \tpm 2.8 & 
			78.0 \tpm 3.4 & 84.8 \tpm 3.5 & 89.4 \tpm 3.4 & 75.2 \tpm 3.3 \\ 
			\midrule
			SWAG-Diag & 
			84.3 \tpm 2.8 & 87.1 \tpm 3.1 & 87.1 \tpm 3.1 & 80.8 \tpm 7.2 &
			80.8 \tpm 4.0 & 84.5 \tpm 3.8 & 84.6 \tpm 3.8 & 73.4 \tpm 14.2 \\
			S2D SWAG-Diag & 
			85.6 \tpm 2.7 & 89.1 \tpm 2.5 & 90.4 \tpm 2.5 & 85.3 \tpm 3.0 &
			81.8 \tpm 4.0 & 86.5 \tpm 3.6 & 90.2 \tpm 3.4 & 76.4 \tpm 3.6 \\
			\midrule
			{MC ensemble} & 
			81.0 \tpm 3.5 & 84.4 \tpm 4.0 & 86.4 \tpm 3.8 & 63.0 \tpm 4.0 & 
			77.0 \tpm 3.6 & 81.7 \tpm 4.0 & 84.9 \tpm 4.0 & 53.1 \tpm 3.1 \\
			{S2D MC ensemble} & 
			83.3 \tpm 2.3 &   86.9 \tpm 3.2 &   90.0 \tpm 2.5 &   77.7 \tpm 5.3 &
			79.3 \tpm 3.2 &   83.8 \tpm 4.3 &   90.1 \tpm 3.2 &   69.8 \tpm 4.8\\
			\midrule
			{Deep ensemble} & 85.9 & 89.1 & 90.9 & 80.4 & 82.0 & 86.3 & 89.1 & 72.5 \\
			{S2D Deep ensemble} & 86.8 & 90.5 & 93.7 & 81.5 &  \textbf{83.0} & 87.9 & 93.9 & 73.4 \\
			\midrule
			{EnD} & 84.7 & 87.4 & & & 81.1 & 84.9 & & \\
			{H2D-Dir} & 85.3 & 88.9 & 88.8 & \textbf{91.7} & 82.5 & 87.4 & 87.6 & \textbf{87.1}\\
			{H2D-Gauss} & \textbf{86.9} & \textbf{90.6} & \boldblue{95.1} & 76.0 & 82.9 & \textbf{88.0} & \boldblue{95.7} & 67.0 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (TIM resize) trained on C100. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: ood tim r}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.12}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}			
			\toprule
			% \multirow{2}{*}{{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			% & Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			% \midrule
			{Individual} & 
			77.6 \tpm 0.7 & 79.5 \tpm 0.7 & & &
			74.2 \tpm 0.7 & 77.1 \tpm 0.9 & & \\
			{S2D Individual} & 
			78.0 \tpm 0.8 & 80.1 \tpm 0.7 & 79.6 \tpm 0.8 & 78.1 \tpm 0.4 & 
			75.3 \tpm 0.9 & 77.7 \tpm 0.9 & 76.6 \tpm 1.2 & 76.3 \tpm 0.5 \\
			\midrule
			MIMO & 
			78.1 \tpm 0.4 & 79.9 \tpm 0.7 & 79.9 \tpm 0.8 & 76.3 \tpm 1.5 & 
			74.6 \tpm 1.0 & 77.3 \tpm 1.3 & 77.4 \tpm 1.3 & 69.6 \tpm 2.0 \\ 
			S2D MIMO & 
			80.1 \tpm 1.2 & 80.7 \tpm 1.2 & 80.7 \tpm 1.2 & 80.4 \tpm 1.2 & 
			77.3 \tpm 1.6 & 77.8 \tpm 1.6 & 77.7 \tpm 1.5 & 77.5 \tpm 1.6 \\ 
			\midrule
			SWAG-Diag & 
			77.7 \tpm 0.7 & 79.6 \tpm 0.6 & 79.6 \tpm 0.6 & 76.4 \tpm 0.7 & 
			74.2 \tpm 0.8 & 77.0 \tpm 0.8 & 77.1 \tpm 0.8 & 70.0 \tpm 0.7 \\
			S2D SWAG-Diag &
			78.6 \tpm 0.7 & 80.5 \tpm 0.6 & 80.1 \tpm 0.7 & 79.2 \tpm 0.5 & 
			75.6 \tpm 0.9 & 78.1 \tpm 1.1 & 77.1 \tpm 1.0 & 76.5 \tpm 0.9 \\
			\midrule
			{MC ensemble} & 
			78.5 \tpm 0.5 & 80.6 \tpm 0.3 & 80.8 \tpm 0.4 & 76.6 \tpm 0.6 & 
			75.2 \tpm 0.5 & 78.1 \tpm 0.6 & 78.4 \tpm 0.5 & 70.9 \tpm 1.1 \\
			{S2D MC ensemble} & 
			79.3 \tpm 0.5 &   81.1 \tpm 0.5 &   81.1 \tpm 0.5 &   80.4 \tpm 0.6 &
			76.4 \tpm 0.7 &   78.5 \tpm 0.8 &   78.1 \tpm 1.0 &   77.1 \tpm 0.7 \\
			\midrule
			{Deep ensemble} & 
			\textbf{81.7} & 83.6 & 83.5 & 81.0 & 78.9 & 81.6 & 81.5 & 76.6 \\
			{S2D Deep Ensemble} & 
			81.5 & \boldblue{84.2} & 82.8 & 82.8 & \textbf{79.1} & \textbf{82.0} & 79.9 & 80.0 \\
			\midrule
			{EnD} & 78.7 & 80.4 & & & 75.4 & 78.0 & & \\
			{H2D-Dir} & 77.3 & 79.8 & 79.6 & 81.6 & 74.5 & 77.9 & 77.7 & 79.2 \\
			{H2D-Gauss} & 80.5 & 82.6 & \textbf{83.7} & \textbf{82.8} & 78.8 & 81.4 & \boldblue{82.5} & \textbf{80.1} \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (TIM random crop) trained on C100. \textbf{Best} in column and \boldblue{best} overall.} 
	\label{table: ood tim c}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.12}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}
			\toprule
			%\multirow{2}{*}{\textbf{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			% & Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			%\midrule
			{Individual} &
			76.7 \tpm 4.1 & 79.2 \tpm 4.2 & & &
			74.7 \tpm 3.6 & 78.5 \tpm 3.8 & & \\
			{S2D Individual} & 
			80.2 \tpm 5.9 & 85.4 \tpm 6.2 & 84.5 \tpm 5.9 & 86.4 \tpm 6.3 &
			79.3 \tpm 6.3 & 83.3 \tpm 6.7 & 81.9 \tpm 6.6 & 83.1 \tpm 6.7 \\
			\midrule
			MIMO & 
			79.4 \tpm 4.8 & 81.9 \tpm 5.3 & 81.9 \tpm 5.3 & 79.8 \tpm 4.6 & 
			77.1 \tpm 4.8 & 80.9 \tpm 5.2 & 80.8 \tpm 5.3 & 74.9 \tpm 8.1 \\ 
			S2D MIMO & 
			80.3 \tpm 8.6 & 86.5 \tpm 8.5 & 86.5 \tpm 8.5 & 86.9 \tpm 8.6 & 
			80.0 \tpm 6.5 & 82.9 \tpm 6.4 & 83.0 \tpm 6.4 & 84.9 \tpm 6.5 \\ 
			\midrule
			SWAG-Diag & 
			78.4 \tpm 3.5 & 80.9 \tpm 3.7 & 80.9 \tpm 4.0 & 78.6 \tpm 2.0 & 
			76.0 \tpm 3.3 & 79.8 \tpm 3.4 & 79.7 \tpm 3.7 & 73.7 \tpm 3.5 \\ 
			S2D SWAG-Diag & 
			80.5 \tpm 6.0 & 84.8 \tpm 6.5 & 83.8 \tpm 6.3 & 86.6 \tpm 6.6 & 
			79.4 \tpm 5.5 & 83.4 \tpm 6.1 & 81.8 \tpm 6.2 & 83.4 \tpm 6.0 \\ 
			\midrule
			{MC ensemble} & 
			75.8 \tpm 4.5 & 78.8 \tpm 4.8 & 79.7 \tpm 4.9 & 69.3 \tpm 3.7 & 
			74.3 \tpm 4.0 & 78.5 \tpm 4.3 & 80.0 \tpm 4.3 & 60.8 \tpm 3.7 \\
			{S2D MC ensemble} & 
			78.8 \tpm 6.3 & 82.1 \tpm 6.4 & 82.6 \tpm 6.5 & 82.0 \tpm 6.1 &
			77.1 \tpm 5.2 & 81.1 \tpm 5.1 & 81.8 \tpm 5.1 & 79.8 \tpm 4.9 \\ 
			\midrule
			{Deep ensemble} & 80.9 & 84.2 & 83.5 & 82.3 & 79.3 & 83.9 & 83.2 & 79.8 \\
			{S2D Deep ensemble} & \textbf{84.8} & \textbf{88.5} & 86.4 & \boldblue{89.7} & \textbf{82.8} & \textbf{87.3} & 84.4 & \boldblue{87.7} \\
			\midrule
			{EnD} & 72.7 & 74.8 & & & 71.4 & 75.0 & & \\
			{H2D-Dir} & 74.7 & 78.2 & 77.9 & 84.2 & 73.2 & 77.7 & 77.5 & 81.7 \\
			{H2D-Gauss} & 83.2 &  88.0 & \textbf{88.0} & 88.5 & 81.0 & 86.0 & \textbf{87.2} & 84.1 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\newpage
\subsection{Ensemble Size Experiments}
% Introduce that the performance of ku does not match previous studies
% Describe new results

Knowledge uncertainty was found to have underwhelming performance (especially for MC and Deep ensembles) and did not show similar trends to prior work \citep{prior-networks, structured, en2d}. To possibly mitigate this, the ensemble size was increased as a smaller number of models could lead to inaccurate measures of diversity and knowledge uncertainty. Results are compiled in Tables \ref{tab:ensembles-test-v1}-\ref{table: ood tim c ensemble}.

\begin{table}[b!]
	\centering{}
	\begin{minipage}[t]{1.0\textwidth}%
		\begin{center}
			\caption{Test performance of various ensembles and sizes ($\pm$ 2 std). All models are trained on C100.} 
			\vspace{-2mm}
			\bgroup
			\def\arraystretch{1.08}
			\small
			\begin{tabular}{c|c|l|l|l}
				\toprule
				\multirow{2}{*}{{Ensemble Type}} & 
				\multirow{2}{*}{{Ensemble Size (M)}} &
				\multirow{2}{*}{{Acc.}} & 
				\multirow{2}{*}{{NLL}} &
				\multirow{2}{*}{{\%ECE}} \\
				& & & & \\
				\midrule
				& 5   & 75.6 \tpm 0.9 & 0.94 \tpm 0.04 & 6.67 \tpm 1.18 \\
				MC & 10 & 75.8 \tpm 0.9 & 0.92 \tpm 0.04 & 6.11 \tpm 1.11 \\
				& 20 & 76.0 \tpm 1.0 & 0.91 \tpm 0.04 & 5.81 \tpm 1.12 \\
				\midrule
				& 5   & 79.3 & 0.76 & 1.44 \\
				Deep & 10  & 80.1 & 0.71 & 1.91 \\
				& 20 & 80.3 & 0.68 & 2.19 \\
				\bottomrule
			\end{tabular}
			\egroup
			\label{tab:ensembles-test-v1}
			\par\end{center}
	\end{minipage}
\end{table}

Performance on the CIFAR-100 test set is shown in Table \ref{tab:ensembles-test-v1}. Increasing the ensemble size leads to improved accuracy and lower negative log-likelihoods as would be expected. The MC ensemble also becomes better calibrated. The Deep ensemble on the other hand has increasing calibration error with the number of members. This is due to the ensemble prediction becoming under-confident when averaging over a large number of members. 

Out-of-distribution detection performance on LSUN, SVHN and TIM are compiled in Tables \ref{table: ood lsun r ensemble}-\ref{table: ood tim c ensemble}. Although the MC ensemble enjoys improved accuracy when increased in size, it seems to remain relatively unaffected in terms of OOD detection using any uncertainty metric. In detecting LSUN using random crops, the performance of KU interestingly deteriorates notably. 
%
Overall this points to MC ensembles' lacking ability in utilising new information from additional ensemble member draws/samples for better uncertainty estimation. 
%
Regarding the Deep ensemble, it generally improves with increasing size with any metric, however with diminishing returns. In this case all uncertainties improve with ensemble size, not only knowledge uncertainty. Therefore it seems that the cause for confidence, total and data outperforming knowledge uncertainty is not due to the ensemble size being limited to five members.

\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (LSUN resize) trained on C100.}
	\label{table: ood lsun r ensemble}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{cc|llll|llll}
			\toprule
			\multirow{2}{*}{{Type}} & \multirow{2}{*}{M} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			& & Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			\midrule
			& 5 & 
			76.6 \tpm 0.8 & 78.3 \tpm 0.8 & 78.9 \tpm 0.8 & 72.4 \tpm 1.2 & 
			72.2 \tpm 1.0 & 74.6 \tpm 1.6 & 75.6 \tpm 1.7 & 64.2 \tpm 2.0 \\
			MC & 10 & 
			76.7 \tpm 0.6 & 78.3 \tpm 0.8 & 79.1 \tpm 0.9 & 72.6 \tpm 1.2 & 
			72.3 \tpm 1.1 & 74.6 \tpm 1.6 & 75.9 \tpm 1.7 & 64.3 \tpm 2.0 \\
			& 20 & 
			76.8 \tpm 0.7 & 78.4 \tpm 0.8 & 79.2 \tpm 0.8 & 72.7 \tpm 1.3 & 
			72.4 \tpm 1.2 & 74.6 \tpm 1.6 & 76.0 \tpm 1.7 & 64.3 \tpm 2.3 \\
			\midrule
			& 5 & 
			81.1 & 82.9 & 83.4 & 79.2 & 
			77.7 & 80.4 & 81.2 & 73.6 \\
			Deep & 10 & 
			82.0 & 83.9 & 84.8 & 80.3 & 
			79.1 & 81.8 & 83.4 & 74.9 \\
			& 20 &
			82.2 & 84.0 & 85.1 & 80.9 & 
			79.4 & 81.8 & 83.6 & 75.7  \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (LSUN random crop) trained on C100.} 
	\label{table: ood lsun c ensemble}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{cc|llll|llll}
			\toprule
			& 5 & 
			81.0 \tpm 3.5 & 84.4 \tpm 4.0 & 86.4 \tpm 3.8 & 63.0 \tpm 4.0 & 
			77.0 \tpm 3.6 & 81.7 \tpm 4.0 & 84.9 \tpm 4.0 & 53.1 \tpm 3.1 \\
			MC & 10 &
			81.0 \tpm 3.5 & 84.4 \tpm 3.9 & 86.7 \tpm 3.7 & 61.6 \tpm 3.9 & 
			77.0 \tpm 3.7 & 81.8 \tpm 4.0 & 85.4 \tpm 4.0 & 52.2 \tpm 3.0 \\
			& 20 &
			80.8 \tpm 3.7 & 84.1 \tpm 4.1 & 86.6 \tpm 3.9 & 60.9 \tpm 4.0 & 
			76.7 \tpm 3.9 & 81.3 \tpm 4.2 & 85.3 \tpm 4.2 & 51.7 \tpm 3.0 \\
			\midrule
			& 5 & 
			85.9 & 89.1 & 90.9 & 80.4 & 
			82.0 & 86.3 & 89.1 & 72.5 \\
			Deep & 10 &
			85.7 & 89.3 & 91.3 & 81.3 & 
			81.8 & 86.4 & 89.9 & 73.1 \\
			& 20 &
			86.2 & 89.8 & 92.2 & 82.0 & 
			82.1 & 86.8 & 91.0 & 73.1  \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (SVHN) trained on C100.} 
	\label{table: ood svhn ensemble}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{cc|llll|llll}
			\toprule
			& 5 &
			79.0 \tpm 4.3 & 81.6 \tpm 4.7 & 83.1 \tpm 4.6 & 68.3 \tpm 3.0 & 
			88.1 \tpm 2.8 & 89.3 \tpm 3.3 & 90.7 \tpm 3.1 & 77.4 \tpm 1.8 \\
			MC & 10 &
			78.9 \tpm 4.4 & 81.5 \tpm 4.7 & 83.3 \tpm 4.7 & 67.5 \tpm 3.1 & 
			88.0 \tpm 2.7 & 89.3 \tpm 3.3 & 90.9 \tpm 3.1 & 76.6 \tpm 2.0 \\
			& 20 &
			78.9 \tpm 4.4 & 81.5 \tpm 4.7 & 83.3 \tpm 4.7 & 67.1 \tpm 3.3 & 
			88.1 \tpm 2.7 & 89.2 \tpm 3.3 & 90.9 \tpm 3.1 & 76.3 \tpm 2.0 \\
			\midrule
			& 5 &
			84.5 & 87.2 & 86.8 & 85.0 & 
			91.3 & 92.5 & 92.2 & 91.5 \\
			Deep & 10 &
			84.1 & 87.0 & 87.5 & 83.9 & 
			91.2 & 92.4 & 93.1 & 90.3 \\
			& 20 &
			83.7 & 86.6 & 87.2 & 84.1 & 
			91.0 & 92.2 & 92.9 & 90.6  \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (TIM resize) trained on C100.} 
	\label{table: ood tim r ensemble}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{cc|llll|llll}
			\toprule
			& 5 &
			78.5 \tpm 0.5 & 80.6 \tpm 0.3 & 80.8 \tpm 0.4 & 76.6 \tpm 0.6 &
			75.2 \tpm 0.5 & 78.1 \tpm 0.6 & 78.4 \tpm 0.5 & 70.9 \tpm 1.1 \\
			MC & 10 &
			78.7 \tpm 0.6 & 80.8 \tpm 0.4 & 81.0 \tpm 0.5 & 77.4 \tpm 0.7 & 
			75.4 \tpm 0.6 & 78.4 \tpm 0.6 & 78.7 \tpm 0.5 & 72.2 \tpm 1.1 \\
			& 20 &
			78.8 \tpm 0.5 & 80.9 \tpm 0.4 & 81.2 \tpm 0.4 & 77.9 \tpm 0.7 & 
			75.6 \tpm 0.5 & 78.4 \tpm 0.4 & 78.8 \tpm 0.4 & 72.9 \tpm 1.4 \\
			\midrule
			& 5 &
			81.7 & 83.6 & 83.5 & 81.0 & 
			78.9 & 81.6 & 81.5 & 76.6 \\
			Deep & 10 &
			82.3 & 84.1 & 84.2 & 82.4 & 
			79.8 & 82.2 & 82.4 & 78.7 \\
			& 20 &
			82.6 & 84.4 & 84.5 & 83.0 & 
			80.1 & 82.4 & 82.8 & 79.6  \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{OOD detection results (TIM random crop) trained on C100.} 
	\label{table: ood tim c ensemble}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{cc|llll|llll}
			\toprule
			& 5 &
			75.8 \tpm 4.5 & 78.8 \tpm 4.8 & 79.7 \tpm 4.9 & 69.3 \tpm 3.7 & 
			74.3 \tpm 4.0 & 78.5 \tpm 4.5 & 80.0 \tpm 4.3 & 60.8 \tpm 3.7 \\
			MC & 10 &
			75.7 \tpm 4.8 & 78.7 \tpm 5.1 & 79.7 \tpm 5.2 & 69.1 \tpm 3.9 & 
			74.2 \tpm 4.2 & 78.5 \tpm 4.5 & 80.2 \tpm 4.5 & 60.7 \tpm 3.8 \\
			& 20 &
			75.7 \tpm 4.7 & 78.6 \tpm 5.0 & 79.7 \tpm 5.2 & 69.0 \tpm 4.1 & 
			74.3 \tpm 4.1 & 78.4 \tpm 4.4 & 80.3 \tpm 4.3 & 60.6 \tpm 4.4 \\
			\midrule
			& 5 &
			80.9 & 84.2 & 83.5 & 82.3 & 
			79.3 & 83.9 & 83.2 & 79.8 \\
			Deep & 10 &
			82.8 & 86.5 & 85.7 & 85.5 & 
			81.0 & 85.8 & 85.0 & 83.7 \\
			& 20 &
			83.4 & 87.1 & 86.1 & 86.8 & 
			81.6 & 86.4 & 85.4 & 85.4  \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}



\section{Behaviour of Uncertainties}
\label{sec:behaviour}
% Add plots comparing:
% 		Deep Ensemble
% 		S2D Individual
% 		S2D Deep Ensemble
% 		H2D-Gauss
% On both ID (C100) and OOD (SVHN) dataset
% Compare plots

This section investigates how the uncertainties produced from a vanilla Deep ensemble differ from self-distribution distilled derived systems, and how well hierarchical distribution distillation captures the behaviour of its teacher. The comparison will be made between the in-domain CIFAR-100 and, out of simplicity, only the out-of-domain SVHN test set.

\begin{figure*}[h!]
	\centering
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\includegraphics[width=\textwidth]{figures/ID+TU}
		\caption{ID: Total uncertainty}
		\label{fig:ID+TU}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\includegraphics[width=\textwidth]{figures/ID+DU}
		\caption{ID: Data uncertainty}
		\label{fig:ID+DU}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\includegraphics[width=\textwidth]{figures/ID+KU}
		\caption{ID: Knowledge uncertainty}
		\label{fig:ID+KU}
	\end{subfigure}
	% \hfill
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\includegraphics[width=\textwidth]{figures/OOD+TU}
		\caption{OOD: Total uncertainty}
		\label{fig:OOD+TU}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\includegraphics[width=\textwidth]{figures/OOD+DU}
		\caption{OOD: Data uncertainty}
		\label{fig:OOD+DU}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\includegraphics[width=\textwidth]{figures/OOD+KU}
		\caption{OOD: Knowledge uncertainty}
		\label{fig:OOD+KU}
	\end{subfigure}
	\caption{Histograms of various uncertainties produced by Deep ensemble, S2D, S2D Deep ensemble and H2D-Gauss systems. Out-of-distribution data was generated from the SVHN test set.}
	\label{fig:uncs}
\end{figure*}

Figure \ref{fig:uncs} shows the contrast of various uncertainties between an CIFAR-100 (ID) and SVHN (OOD) test sets. Clearly, the S2D systems output ID uncertainties in a consistent manner, even matching the conceptually different Deep ensemble. Observe that S2D integrates temperature scaling (smoothing predictions) into the training of models; total and data uncertainties\footnote{Knowledge uncertainty does not necessarily increase with temperature.} estimated by these models will naturally have larger entropy than Deep ensembles. While it is expected that the Deep ensemble would have different behaviour on the SVHN OOD set, it is surprising to observe how well H2D-Gauss aligns with its S2D Deep ensemble teacher. An individual S2D model was also able to generate closely related total and data uncertainty estimates, but suffers significantly in producing consistent knowledge uncertainties. These results raise the question if a Gaussian student could capture the diversity in a vanilla Deep ensemble by modelling the logits, in a similar fashion to how H2D-Gauss models its teacher---a possible avenue for future work.


\section{Additional Experiments: WideResNet}
\label{sec:wrn}
% Repeat experiments from main section with WideResNets instead

Following the DenseNet-BC experiments in section \ref{sec:experiment} we repeated them with a different architecture. In this section we focus on a significantly larger WideResNet \citep{wideresnet} model with a depth of 28 and a widening factor of 10. The standard and S2D models were both trained as described in \citet{wideresnet}, with the S2D specific parameters being the same as previously described. The only difference is that teacher predictions were generated using multiplicative Gaussian noise with a fixed standard deviation of $0.10$. 

The H2D-Gauss model was also trained in a different manner. First, it was initialised from an S2D model trained for 150 epochs. Thereafter it was trained for an additional 80 epochs with a starting learning rate of $\eta = 2 \times 10^{-3}$ which was reduced by a factor of 4 after 60 epochs. For this section, EnD and H2D-Dir were not investigated. 

Table \ref{tab:all-wrn-models-v1} shows test set performance. Unlike previous experiments, S2D was not able to outperform an individual model by more than two standard deviations, in this case achieving around one standard deviation improvement in accuracy. Interestingly, the MC approach has worse accuracy for both the standard and S2D case, however this could be due to the small number of drawn samples ($M = 5$).
\begin{table}[h!]
	\centering{}
	\begin{minipage}[t]{1.0\textwidth}%
		\begin{center}
			\caption{Test performance ($\pm$ 2 std).} 
			\vspace{-2mm}
			\bgroup
			\def\arraystretch{1.08}
			\small
			\begin{tabular}{l|l|l|l|l|l|l}
				\toprule 
				\multirow{2}{*}{{Dataset}} & 
				\multicolumn{3}{c|}{\multirow{2}{*}{{C100}}} & 
				\multicolumn{3}{c}{\multirow{2}{*}{{C100+}}} \\
				& \multicolumn{3}{c|}{} & \multicolumn{3}{}{} \\
				\midrule
				\multirow{2}{*}{{Model}} & 
				\multirow{2}{*}{{Acc.}} & 
				\multirow{2}{*}{{NLL}} &
				\multirow{2}{*}{{\%ECE}} &
				\multirow{2}{*}{{Acc.}} & 
				\multirow{2}{*}{{NLL}} &
				\multirow{2}{*}{{\%ECE}} \\
				& & & & \\
				\midrule
				Individual & 73.9 \tpm 0.5 & 1.05 \tpm 0.02 & 5.26 \tpm 0.78  & 81.1 \tpm 0.3 & 0.76 \tpm 0.01 & 5.21 \tpm 0.44 \\
				S2D Individual & 74.2 \tpm 0.5 & 1.06 \tpm 0.05 & 5.48 \tpm 2.25  & 81.3 \tpm 0.3 & 0.74 \tpm 0.01 & 4.24 \tpm 0.74 \\
				\midrule
				MC ensemble & 73.6 \tpm 0.5 & 1.05 \tpm 0.03 & 4.70 \tpm 0.88  & 81.0 \tpm 0.5 & 0.74 \tpm 0.01 & 3.29 \tpm 0.36 \\
				S2D MC ensemble & 73.8 \tpm 0.4 & 1.03 \tpm 0.04 & 2.95 \tpm 1.01  & 81.0 \tpm 0.3 & 0.73 \tpm 0.01 & 1.99 \tpm 0.35 \\
				\midrule
				Deep ensemble & 77.1 & 0.88 & 5.08  & 83.4 & 0.63 & 2.27 \\
				S2D Deep ensemble & 77.9 & 0.86 & 4.52  & 83.6 & 0.63 & 1.84  \\
				\midrule
				% EnD & 75.01 & 1.12 & 9.69 & & & \\
				H2D-Gauss & 77.4 & 0.95 & 5.19 & 82.8 & 0.71 & 2.45 \\
				\bottomrule
			\end{tabular}
			\egroup
			\label{tab:all-wrn-models-v1}
			\par\end{center}
	\end{minipage}
\end{table}
Furthermore, both Deep ensembles significantly outperform their individual equivalents with the S2D version being slightly better in all measured performance metrics. The notable result in this table is the high performance of H2D-Gauss, able to outperform the Deep ensemble in C100 and achieve near ensemble performance in C100+.

In the OOD detection task we observe that both versions of the MC ensemble struggle to outperform their individual counterparts. There also seems to be a disparity in performance when comparing resize and random cropped LSUN and TIM. With random crops, all S2D systems notably outperform their standard counterparts. In this case both S2D Individual and H2D-Gauss were able to outperform the Deep ensemble using any uncertainty metric. In the other case of resizing LSUN and TIM images and in SVHN the detection performance difference is smaller but the S2D Deep ensemble still remains the best model with both H2D-Gauss and Deep ensemble performing similarly.
% 
\begin{table}[h!]
	% \centering{}
	\caption{LSUN (resize) OOD detection results. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: wrn ood lsun r}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}
			\toprule
			\multirow{2}{*}{{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			& Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			\midrule
			{Individual} & 
			76.3 \tpm 0.5 & 76.7 \tpm 0.6 & & & 
			70.7 \tpm 0.8 & 71.1 \tpm 0.9 & & \\
			{S2D Individual} & 
			76.0 \tpm 1.1 & 76.5 \tpm 1.5 &  76.7 \tpm 1.4 & 75.7 \tpm 1.6 &
			71.4 \tpm 1.8 &  72.0 \tpm 2.7 &  72.8 \tpm 3.7 & 69.7 \tpm 2.0 \\
			\midrule
			{MC ensemble} & 
			75.8 \tpm 0.6 & 76.2 \tpm 0.7 & 76.4 \tpm 0.7 & 65.2 \tpm 1.7 & 
			70.3 \tpm 1.0 & 70.5 \tpm 1.1 & 70.8 \tpm 1.2 & 56.2 \tpm 1.5 \\
			{S2D MC ensemble} & 
			75.7 \tpm 1.0 & 76.4 \tpm 1.7 & 77.0 \tpm 1.6 & 75.2 \tpm 2.1 & 
			71.0 \tpm 1.6 & 71.6 \tpm 2.7 & 73.1 \tpm 3.8 & 69.6 \tpm 2.6 \\
			\midrule
			{Deep ensemble} & 77.6  & 78.0  & 78.4  & 68.0  & 72.3  & 72.6  & 73.1  & 58.8  \\
			{S2D Deep ensemble} & 
			\textbf{77.7} & \textbf{78.5} & \boldblue{79.3} & {76.8} & 
			\textbf{73.2} & \textbf{74.1} & \boldblue{75.9} & {71.3} \\
			\midrule
			{H2D-Gauss} & 
			77.1 & 77.2 & 77.8 & \textbf{77.5} &
			72.0 & 71.8 & 71.9 & \textbf{72.3} \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{LSUN (random crop) OOD detection results. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: wrn ood lsun c}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}
			\toprule
			%\multirow{2}{*}{\textbf{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			% & Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			%\midrule
			{Individual} & 
			72.4 \tpm 5.0 & 73.9 \tpm 5.4 & & &
			67.0 \tpm 2.9 & 68.7 \tpm 3.1 & & \\
			{S2D Individual} & 75.8 \tpm 3.4 & 77.6 \tpm 4.3 & 77.9 \tpm 4.7 & \textbf{76.5} \tpm 4.6 &
			70.5 \tpm 3.9 & 72.6 \tpm 4.9 & 74.4 \tpm 4.7 & \textbf{71.4} \tpm 5.5 \\
			\midrule
			{MC ensemble} & 
			68.9 \tpm 5.6 & 70.3 \tpm 6.0 & 70.9 \tpm 6.2 & 50.8 \tpm 3.7 &
			64.0 \tpm 3.0 & 65.2 \tpm 3.5 & 66.1 \tpm 3.6 & 45.7 \tpm 1.5 \\
			{S2D MC ensemble} & 
			72.7 \tpm 3.2 & 74.5 \tpm 4.1 & 75.9 \tpm 4.3 & 72.0 \tpm 4.4 & 
			67.7 \tpm 3.3 & 69.7 \tpm 4.6 & 73.4 \tpm 4.4 & 65.7 \tpm 5.0 \\
			\midrule
			{Deep ensemble} & 72.1  & 74.2  & 75.2  & 60.6  & 67.2  & 69.2  & 70.5  & 51.6  \\
			{S2D Deep ensemble} & 
			75.5 & \textbf{78.4} & \boldblue{80.0} & 75.4 &
			\textbf{70.7} & \textbf{73.9} & \boldblue{77.2} & 69.0 \\
			\midrule
			{H2D-Gauss} & 
			\textbf{76.0} & 77.6 & 77.8 & 76.4 & 
			69.6 & 71.5 & 74.1 & 70.9 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{SVHN OOD detection results. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: wrn ood svhn}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}
			\toprule
			%\multirow{2}{*}{\textbf{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			% & Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			%\midrule
			{Individual} & 
			80.1 \tpm 4.6 & 81.6 \tpm 4.4 & & &
			88.3 \tpm 2.4 & 89.0 \tpm 2.3 & &  \\
			{S2D Individual} & 
			80.1 \tpm 4.4 & 81.6 \tpm 4.4 & 81.9 \tpm 4.8 & 81.4 \tpm 5.4 &
			88.6 \tpm 2.3 & 89.2 \tpm 2.5 & 90.1 \tpm 2.5 & 87.8 \tpm 4.1 \\
			\midrule
			{MC ensemble} & 
			77.6 \tpm 4.9 & 79.1 \tpm 4.5 & 79.7 \tpm 4.5 & 56.6 \tpm 2.5 & 
			86.9 \tpm 2.3 & 87.5 \tpm 2.2 & 88.0 \tpm 2.2 & 70.2 \tpm 1.2 \\
			{S2D MC ensemble} & 
			77.3 \tpm 4.7 & 79.0 \tpm 4.8 & 80.1 \tpm 4.6 & 77.3 \tpm 5.6 & 
			87.1 \tpm 2.5 & 87.7 \tpm 2.7 & 89.6 \tpm 2.5 & 85.7 \tpm 3.9 \\
			\midrule
			{Deep ensemble} & \textbf{81.5}  & 83.4  & 84.0  & 68.3  & 89.2  & 89.9  & 90.4  & 77.9  \\
			{S2D Deep ensemble} & 
			81.5 & \textbf{83.7} & \boldblue{84.6} & \textbf{81.8} & 
			\textbf{89.6} & \textbf{90.5} & \boldblue{92.0} & \textbf{88.1} \\
			\midrule
			{H2D-Gauss} & 
			81.5 & 82.1 & 83.2 & 80.6 & 
			88.6 & 88.4 & 90.5 & 87.1 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{TIM (resize) OOD detection results. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: wrn ood tim (r)}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}
			\toprule
			%\multirow{2}{*}{\textbf{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			% & Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			%\midrule
			{Individual} & 
			79.7 \tpm 0.4 & 80.5 \tpm 0.4 & & &
			75.9 \tpm 0.5 & 76.9 \tpm 0.5 & & \\
			{S2D Individual} & 
			79.2 \tpm 0.6 &  80.0 \tpm 0.5 & 80.2 \tpm 0.3 & 80.2 \tpm 0.4 & 
			76.0 \tpm 1.0 & 77.1 \tpm 1.0 & 77.1 \tpm 0.7 & 76.7 \tpm 0.7 \\
			\midrule
			{MC ensemble} & 
			79.8 \tpm 0.4 & 80.6 \tpm 0.3 & 80.7 \tpm 0.4 & 68.3 \tpm 1.7 & 
			76.1 \tpm 0.7 & 77.0 \tpm 0.6 & 77.1 \tpm 0.6 & 59.5 \tpm 1.6 \\
			{S2D MC ensemble} & 
			79.4 \tpm 0.6 & 80.3 \tpm 0.7 & 80.2 \tpm 1.0 & 80.1 \tpm 0.7 & 
			75.9 \tpm 0.9 & 77.1 \tpm 1.0 & 77.2 \tpm 1.1 & 76.8 \tpm 0.6 \\
			\midrule
			{Deep ensemble} & 81.8  & 82.7  & 82.7  & 72.5  & 78.4  & 79.3  & 79.2  & 64.1  \\
			{S2D Deep ensemble} & 
			\textbf{81.9} & \textbf{82.9} & \boldblue{82.9} & \textbf{82.5} &
			\textbf{79.0} & \textbf{80.2} & \boldblue{80.2} & \textbf{79.6} \\
			\midrule
			{H2D-Gauss} & 
			80.9 & 81.4 & 81.4 & 81.5 &
			77.4 & 79.0 & 78.9 & 78.0 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}

\begin{table}[h!]
	% \centering{}
	\caption{TIM (random crop) OOD detection results. \textbf{Best} in column and \boldblue{best} overall.}
	\label{table: wrn ood tim (c)}
	\vspace{-2mm}
	\bgroup
	\def\arraystretch{1.08}
	\begin{adjustbox}{center}
		\small
		\begin{tabular}{l|llll|llll}
			\toprule
			%\multirow{2}{*}{\textbf{Model}} & \multicolumn{4}{c|}{OOD \%AUROC} & \multicolumn{4}{c}{OOD \%AUPR} \\
			% & Conf. & TU & DU & KU & Conf. & TU & DU & KU \\
			%\midrule
			{Individual} & 
			71.2 \tpm 3.8 & 72.8 \tpm 4.0 & & &
			68.9 \tpm 3.5 & 70.9 \tpm 4.0 & & \\
			{S2D Individual} & 
			73.1 \tpm 3.0 & 74.9 \tpm 3.6 & 76.3 \tpm 3.9 & 75.9 \tpm 3.4 & 
			71.4 \tpm 1.7 & 73.7 \tpm 2.2 & 74.5 \tpm 2.4 & 73.4 \tpm 2.4 \\
			\midrule
			{MC ensemble} & 
			70.1 \tpm 3.5 & 71.8 \tpm 3.7 & 72.1 \tpm 3.7 & 57.1 \tpm 1.0 & 
			68.1 \tpm 3.6 & 70.2 \tpm 3.9 & 70.6 \tpm 3.9 & 50.4 \tpm 1.1 \\
			{S2D MC ensemble} & 
			71.7 \tpm 2.7 & 73.8 \tpm 3.2 & 74.2 \tpm 3.3 & 73.7 \tpm 3.1 &
			70.0 \tpm 1.5 & 72.6 \tpm 1.7 & 73.3 \tpm 1.8 & 71.9 \tpm 1.6 \\
			\midrule
			{Deep ensemble} & 72.2  & 74.5  & 74.7  & 65.2  & 70.3  & 72.9  & 73.0  & 58.1  \\
			{S2D Deep ensemble} & 
			74.3 & \textbf{77.0} & {77.3} & \textbf{77.1} & 
			\textbf{72.6} & \textbf{75.9} & \boldblue{76.2} & \textbf{75.5} \\
			\midrule
			{H2D-Gauss} & 
			\textbf{75.2} & 76.9 & \boldblue{77.3} & 76.4 & 
			72.0 & 74.0 & 74.5 & 73.5 \\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
	\egroup
\end{table}







% \section{Additional Uncertainty Metrics}
% Describe that the chosen uncertainties are the typical ones
% Describe that there are more options depending on the type of model
% Describe that the hierarchy of distribution modelled by model decides flexibility in estimating wide variety of uncertainties

% \subsection{Derivations}
% Derive all uncertainties for TU, DU, KU

% \subsection{Out-of-distribution Detection Experiments}
% Run experiments with these uncertainties



\end{document}
