% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
\usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{subfigure}
\usepackage{bibentry}
% END REMOVE bibentry

\usepackage{xcolor}
\usepackage{colortbl}
\usepackage[switch]{lineno}

\usepackage{etoolbox}
\newrobustcmd\B{\DeclareFontSeriesDefault[rm]{bf}{b}\bfseries}   

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)


\title{Do Bayesian Variational Autoencoders Know What They Don't Know?}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{\href{mailto:<m.glazunov@tudelft.nl>?Subject=Do Bayesian VAEs Know What They Don't Know}{Misha~Glazunov}{}}
\author{Apostolis~Zarras}
% Add affiliations after the authors
\affil{%
Delft University of Technology, the Netherlands
}

  
\begin{document}
\maketitle

\begin{abstract}
The problem of detecting the Out-of-Distribution (OoD) inputs is of paramount importance for Deep Neural Networks. It has been previously shown that even Deep Generative Models that allow estimating the density of the inputs may not be reliable and often tend to make over-confident predictions for OoDs, assigning to them a higher density than to the in-distribution data. This over-confidence in a single model can be potentially mitigated with Bayesian inference over the model parameters that take into account epistemic uncertainty. This paper investigates three approaches to Bayesian inference: stochastic gradient Markov chain Monte Carlo, Bayes by Backpropagation, and Stochastic Weight Averaging-Gaussian. The inference is implemented over the weights of the deep neural networks that parameterize the likelihood of the Variational Autoencoder. We empirically evaluate the approaches against several benchmarks that are often used for OoD detection: estimation of the marginal likelihood utilizing sampled model ensemble, typicality test, disagreement score, and Watanabe-Akaike Information Criterion. Finally, we introduce two simple scores that demonstrate the state-of-the-art performance.
%   This is the abstract for this article.
%   It should give a self-contained single-paragraph summary of the article's contents, including context, results, and conclusions.
%   Avoid citations; but if you do, you must give essentially the whole reference.
%   For example: This whole paper is devoted to praising É. Š. Åland von Vèreweg's most recent book (“Utopia's government formation problems during the last millenium”, Springevier Publishers, 2016).
%   Also, do not put mathematical notation and abbreviations in your abstract; be descriptive.
%   So not “we solve \(x^2+A xy+y^2\), where \(A\) is an RV”, but “we solve quadratic equations in two unknowns in which a single coefficient is a random variable”.
%   The reason is that mathematical notation will not display correctly when the abstract is reused on the proceedings website, for example, and that one should not assume the abstract's reader knows the abbreviation.
%   Of course the same remarks hold for your paper's title.
%
\end{abstract}

\section{Introduction}\label{sec:intro}
\textit{Deep Neural Networks} (DNNs) are trained by \textit{Maximizing the Likelihood Estimate} (MLE) over parameters $\boldsymbol{\theta}$ given the training input data $\mathcal{D}$: $p(\mathcal{D}|\boldsymbol{\theta})$. There exist two main approaches to modeling with DNNs: \textit{discriminative} and \textit{generative}.

The discriminative approach implies parameterizing a conditional distribution over target values $y$: $p(y|\mathbf{x},\boldsymbol{\theta})$. The training of DNNs allows identifying optimum parameters $\boldsymbol{\theta}^*$ based on a stochastic first-order optimization algorithm. In the case of classification tasks, the common choice for $p(y|\mathbf{x},\boldsymbol{\theta})$ is a categorical distribution, in the case of regression---a Gaussian distribution (quite often with a constant variance). As it has been recently discovered: such models tend to be over-confident in their predictions with Out-of-Distribution (OoD) inputs~\citep{nguyen2015deep, hendrycks2016baseline}. This discovery may not be surprising since MLE results in a point estimate and does not account for epistemic uncertainty. Taking into consideration the fact that in modern DNNs $|\boldsymbol{\theta}| \gg |\mathcal{D}|$, there may be several models $\boldsymbol{\theta}^*$ that generated $\mathcal{D}$.

Epistemic uncertainty can be estimated by inferring a posterior distribution: $p(\boldsymbol{\theta}|\mathcal{D})$ which can be done within the \emph{Bayesian} frame of reference. Several promising results were achieved with the discriminative DNNs for OoD detection utilizing \emph{Bayesian} inference over model parameters~\citep{blundell2015weight, chen2014stochastic, maddox2019simple}.

On the other hand, the generative approach allows learning the approximation of a true distribution over the training data: $p(\mathbf{x})$. DNNs again do the parameterization of this density, hence the name: \textit{Deep Generative Models} (DGMs). Since DGMs provide a mechanism to estimate the probability of a particular input, they should supposedly assign a low density to the OoDs. However, recent research revealed that such estimations are prone to errors as DGMs often provide higher density values to OoDs than to \textit{In-Distribution} (ID) data~\citep{nalisnick2018deep}.

As it was the case with the discriminative deep models, to overcome this problem, one may use a \emph{Bayesian} DGM that infers the posterior: $p(\boldsymbol{\theta}|\mathcal{D})$ for the training data $\mathcal{D}$ over the model parameters $\boldsymbol{\theta}$. Such an approach allows getting an ensemble of the approximations of a true distribution of the data where each sample from the posterior $\boldsymbol{\theta} \sim p(\boldsymbol{\theta}|\mathcal{D})$ gives a separate instance of the model in the ensemble. Based on sampling from the posterior distribution, it is possible to estimate the density of the input instance $p(\mathbf{x})$ taking into consideration epistemic uncertainty.

In this work, we implement several methods that are widely applicable to the \emph{Bayesian} inference over DNN parameters, namely: \textit{Bayes by Backpropagation} (BBB)~\citep{blundell2015weight}, \textit{Stochastic Gradient Hamiltonian Monte-Carlo} (SGHMC)~\citep{chen2014stochastic}, and \textit{Stochastic Weight Averaging-Gaussian} (SWAG)~\citep{maddox2019simple}. Most of the methods till now have been only applied to the discriminative supervised DNNs. It should be noted that even though the theoretical justification for \emph{Bayesian} \textit{Variational Autoencoders} (VAEs) was already present in the original paper~\citep{kingma2013auto}, %(see Appendix F), 
there are still very few works addressing this point. In fact, to the best of our knowledge: there is only one paper dedicated to the BVAEs and OoD detection where only one of the methods (i.e., SGHMC) was used~\citep{daxberger2019bayesian}. Our work represents an attempt to close this gap that is currently present between the discriminative and generative approaches based on DNNs. We transfer all of the mentioned methods to the deep generative VAEs and test them against several benchmarks suggested for OoD detection on various image datasets. Finally, based on our experiments, we introduce a couple of simple scores for the OoD detection that surpass all baseline scores.

\noindent In summary, we make the following main contributions:

\begin{itemize}

\item We perform the first implementation of three different \emph{Bayesian} approaches for VAEs estimating epistemic uncertainty.

\item We do a practical benchmarking of the most frequently used scores for OoD detection, taking into consideration \emph{Bayesian} inference over the parameters of the likelihood of VAE.
%
\item We suggest and apply two simple and efficient scores for the OoD detection that outperform baseline scores.
%
\item We empirically evaluate the suggested approach based on several datasets.\footnote{ The source code for the reproducibility of the results is available at \url{https://github.com/DigitalDigger/BayesianVAEsOoD}}

\end{itemize}


\section{Background}
\label{sec:background}

\subsection{Out-of-Distribution}

Deploying a successful model requires the system to detect input data that are statistically anomalous or significantly different from those used during training. This is especially important for DNNs since they tend to produce overconfident predictions for such OoD inputs~\citep{lee2018simple}. The lack of reliability of supervised discriminative models based on DNNs, when faced with OoD, was recently addressed by various methods~\citep{hendrycks2016baseline, hendrycks2018deep, liang2017enhancing}.

Unsupervised DGMs such as autoregressive models~\citep{oord2016conditional}, \textit{Generative Adversarial Networks} (GANs)~\citep{goodfellow2014generative}, flow-based models~\citep{dinh2016density, kingma2016improved, kingma2018glow}, and VAEs~\citep{kingma2013auto, rezende2014stochastic} provide the opportunity to learn the density of the input data.

We choose to apply VAE as a particular instance of DGM in our experiments for several reasons:

\begin{enumerate}
\item It allows to obtain a particular value for the density in contrast to GANs that can only be sampled in a black-box manner.

\item It represents a model based on the latent variable, which seems like a reasonable assumption considering the complexity of the data's underlying density. The latent space has a much lower dimensionality in comparison with the dimensionality of the input. Such a bottleneck allows for learning the most relevant features. It distinguishes VAE from the flow-based models where all the transformations are invertible and represent a bijective mapping between the input and the latent space~\citep{dinh2016density, kingma2018glow, nielsen2020survae}.

%of the data which may provide the means for the explainability as it was %recently successfully demonstrated in the representation %learning~\cite{DBLP:conf/iclr/HigginsMPBGBML17, burgess2018understanding, %chen2018isolating}.
%
\item It allows the parametrization of all the \emph{Bayesian} inference constituents, including the posterior over latent variable and the likelihood of the data~\citep{kingma2013auto, rezende2014stochastic}. Such separation of the constituents makes it possible to work with the particular part, as in our case, with the decoder of the VAE for weight uncertainty estimation, distinguishing VAEs from the autoregressive models.
%
\end{enumerate}

%VAEs are based on the stochastic transformation with the bottlneck layer %that distinguishes them from the flow-based models %\cite{nielsen2020survae}. VAEs are also based on the latent variable that %distinguishes them from the auto-regressive flows %\cite{oord2016conditional}.

Furthermore, as it has been shown by~\citet{nalisnick2019detecting}, different DGM models may not produce similar results, which suggests that it is a good idea to concentrate only on one type for the analysis, which VAEs represent in our case.

However, it has been recently discovered that even in the case of DGMs that allows estimating the density, it does not work as intended and that DGMs return higher $p(\mathbf{x})$ for input data from a different distribution~\citep{nalisnick2018deep}.

There are several approaches to tackle this issue. One possible solution is to enhance the training dataset. \citet{hendrycks2018deep} suggested incorporating the pre-selected anomalous examples employing the so-called outlier exposure technique, which achieved promising results. \citet{ren2019likelihood} proposed a different method: two DGMs are trained separately---one for the semantics of the images and another for the background of the same images. The background images are generated with substantial noise to make the model learn this background, discarding the image's semantics. Then by calculating the likelihood ratios between the model that learned semantics and the model that learned the background, it is possible to detect OoD. However, both of the suggested approaches rely on the knowledge of either the outlier or the image's background, which cannot be reasonably covered for all the possible inputs. Moreover, recent research reveals that the likelihood ratio method does not achieve satisfactory results when \emph{a Bayesian} VAE is applied~\citep{daxberger2019bayesian}. Due to these reasons, we do not consider methods of dataset enhancement in this work.

Another possible solution is to devise an alternative score for the OoD detection. In that vein, the \textit{Watanabe-Akaike Information Criterion} (WAIC) was successfully used by~\citet{choi2018waic}. Further, the disagreement score~\citep{daxberger2019bayesian} was suggested for the same purpose. This idea was motivated within the information-theoretic framework and was also based on the posterior estimation over the model parameters. The considered scores were calculated in both works based on the densities obtained from several models. In the former case, an ensemble was trained to calculate WAIC, while the latter case used the \emph{Bayesian} VAE. In addition, \citet{nalisnick2019detecting} introduced a typicality test of the input sequences under the conjecture that the inlier sequences should be members of the DGM's typical set.

We chose to address the OoD problem similarly: i.e., we suggest and apply new simple scores that help to detect OoDs.


\subsection{Variational Autoencoder}
\label{sec:vae_section}

VAE represents a type of DGM that provides the possibility of density estimation of the input $\mathbf{x}$. The optimization objective of VAE is the \textit{Evidence Lower BOund} (ELBO), which allows joint optimization with respect to both variational parameters $\boldsymbol{\phi}$ of the encoder responsible for the variational approximation of the posterior $q_{\boldsymbol{\phi}}$ over the latent variable $\mathbf{z}$, and the generation parameters $\boldsymbol{\theta}$ of the decoder responsible for the parametrization of the likelihood of the input $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$:
%
\begin{footnotesize}
\begin{equation}
\label{eq:elbo_z}
\mathcal{L_{\boldsymbol{\theta}, \boldsymbol{\phi}}}(\mathbf{x}) = E_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} [\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})]
\end{equation}
\end{footnotesize}

VAEs are trained in an unsupervised manner from data and are widely used for generative purposes.


\subsection{Estimation of the Marginal Likelihood}
\label{sec:marginal_likelihood}

Marginal likelihood can be computed in the following way:
%
\begin{footnotesize}
\begin{equation}
\label{integral_marginal_likelihood}
p_{\boldsymbol{\theta}}(\mathbf{x}) = \int_{\mathbf{z}} p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z} \end{equation}
\end{footnotesize}

However, it is difficult to calculate it precisely due to the integration over the whole $\mathbf{z}$-space. As suggested by~\citet{rezende2014stochastic}, as soon as the VAE is trained, it is possible to estimate the marginal likelihood of the input under the generative model using \emph{importance sampling} w.r.t to the approximated posterior, namely:
%
\begin{footnotesize}
\begin{equation}\label{eq_importance_sampling}
p_{\boldsymbol{\theta}}(\mathbf{x}) \simeq \frac{1}{N} \sum_{i=1}^{N} \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{\textmd{ \emph{(i)}}})}{q_{\boldsymbol{\phi}}(\mathbf{z}_{\textmd{ \emph{(i)}}} | \mathbf{x})},\quad\textrm{where}\quad\mathbf{z}_{\textmd{ \emph{(i)}}} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})
\end{equation}
\end{footnotesize}

As it has been discovered by~\citet{nalisnick2018deep}, we cannot rely directly on the marginal likelihood estimations produced by a single DGM. This fact is not surprising for the discriminative models based on DNNs. Therefore, it should not be shocking for DGMs either, taking into consideration that they are also based on the DNNs and that they also obtain the optimal parameters $\boldsymbol{\theta}^*$ under the maximum likelihood estimation (MLE) for the $p(\mathcal{D}|\boldsymbol{\theta})$ which represents a point estimate. Hence, without \emph{Bayesian} inference over model parameters, it is impossible to estimate the epistemic uncertainty, which results in the model's inability to provide a robust estimation of the marginal likelihood for OoD inputs.


\subsection{Epistemic Uncertainty}

The required posterior estimation over model parameters $p(\boldsymbol{\theta}|\mathcal{D})$ in the case of discriminative DNNs is usually implemented in the following three ways:


\begin{enumerate}
\item By using variational posterior approximation~\citep{blundell2015weight};
\item By sampling from Markov Chain Monte Carlo~\citep{chen2014stochastic};
\item By capturing the local geometry of the posterior through fitting a Gaussian with two first moments of the \textit{Stochastic Gradient Descent} (SGD)~\citep{maddox2019simple}.
\end{enumerate}

The first method represents a similar variational inference approach as in the case of VAEs when their encoders are trained to infer the posterior over the latent variable. The difference is that now it is applied to minimize the KL-divergence between the intractable posterior \emph{over the model parameters} $p(\boldsymbol{\theta}|\mathcal{D})$ and the distribution from the family of tractable distributions $q(\boldsymbol{\theta}|\mathcal{D})$. It is also implemented by maximizing the ELBO. Since in our work we implement it in VAE, it means that we are making the variational inference for both the posterior for the latent variable conditioned on the input and the posterior for the parameters conditioned on the training data. Such an approach is called fully \emph{Bayesian} in the case of VAEs~\citep{kingma2013auto}. There are various methods for approximating the posterior for the model parameters; we will use the one suggested in~\citep{blundell2015weight}.

\textit{Markov Chain Monte Carlo} (MCMC) constructs a Markov chain with the desired posterior as its equilibrium distribution. The most known method for MCMC is based on the Metropolis-Hastings algorithm~\citep{hastings1970monte, doi:10.1063/1.1699114}. This algorithm converges to the real posterior by exploiting the random walk proposal distribution. However, this convergence may be pretty slow due to the slow exploration of the state space based on the random walk. \textit{Hamiltonian Monte Carlo} (HMC)~\citep{duane1987hybrid} postulates the exploration within the framework of the Hamiltonian dynamics. It allows producing distant proposals for the Metropolis algorithm resulting in much faster convergence. We use a variant of HMC adopted for deep learning, namely SGHMC, that relies on the noisy gradient estimates~\citep{chen2014stochastic}.

Finally, it is possible to infer the desired posterior while training due to the noise in the SGD~\citep{mandt2017stochastic}. We apply the \textit{Stochastic Weight Averaging} (SWA)~\citep{izmailov2018averaging} together with fitting a Gaussian using the SWA solution as the first moment and covariance that is also derived from the SGD steps: the so-called SWAG method~\citep{maddox2019simple}. SWAG is easy to implement since it does not require additional sampling and can be used as a baseline for the rest of the methods.


\section{Methodology}

\subsection{Bayesian VAEs}

We implement several possible methods for \emph{Bayesian} VAEs (BVAEs). We apply the \emph{Bayesian} inference over the model parameters of the decoder of the VAE. Such a method allows sampling of several decoders to form an ensemble with the subsequent marginal log-likelihood estimation as indicated in Equation~\ref{eq_importance_sampling}.

\medskip \noindent \textbf{\textit{Bayes by Backpropagation.}}
%
We approximate  the posterior distribution of the VAE decoder parameters given the training data $p(\boldsymbol{\theta}|\mathcal{D})$ based on the method suggested by~\citet{blundell2015weight}. This method was initially applied to discriminative learning. In our work, we implement it in VAEs. The ELBO objective is to find distribution parameters $\boldsymbol{\lambda}$ that minimize KL-divergence between our approximation and the true posterior; hence, the ELBO is formulated in the following way:
%
\begin{footnotesize}
\begin{equation}
\mathcal{F_{\boldsymbol{\theta}}}(\mathcal{D}, \boldsymbol{\lambda}) =  \mathbb{E}_{q(\boldsymbol{\theta}|\boldsymbol{\lambda})}\bigg[\log p(\mathcal{D}|\boldsymbol{\theta}) + \log p(\boldsymbol{\theta}) - \log q(\boldsymbol{\theta}|\boldsymbol{\lambda})\bigg]
\end{equation}
\end{footnotesize}
%
$\log p(\mathcal{D}|\boldsymbol{\theta})$ represents the sum of the marginal likelihoods of the individual inputs:
%
\begin{footnotesize}
\begin{equation}
\log p(\mathcal{D}|\boldsymbol{\theta}) = \sum_{i=1}^{N} \log p(\mathbf{x}^{(i)}|\boldsymbol{\theta})
\end{equation}
\end{footnotesize}
%
and
%
\begin{footnotesize}
\begin{equation}
\begin{split}
\log p(\mathbf{x}^{(i)}|\boldsymbol{\theta}) \geq \mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\mathbf{x}^{(i)})
\end{split}
\end{equation}
\end{footnotesize}
%
where $\mathcal{L}_{\boldsymbol{\theta, \phi}}(\mathbf{x})$ is the ELBO for the marginal likelihood marginalized over the latent variable and it is defined in Equation~\ref{eq:elbo_z}. Since ELBO is the lower bound of the marginal likelihood, we can use it for our approximation. The optimization objective is formulated as:
%
\begin{footnotesize}
\begin{equation}
\begin{split}
\widetilde{\mathcal{F}_{\boldsymbol{\theta}}}(\mathcal{D}, \boldsymbol{\lambda}) = \mathbb{E}_{q(\boldsymbol{\theta}|\boldsymbol{\lambda})}\bigg[&\sum_{i=1}^{N} \Big[ \mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x}^{(i)})\Big] + \log p(\boldsymbol{\theta}) - \log q(\boldsymbol{\theta}|\boldsymbol{\lambda})\bigg]
\end{split}
\end{equation}
\end{footnotesize}

Now, if we plug in the right-hand side of the objective in Equation~\ref{eq:elbo_z}  we can get the Monte-Carlo estimation of the combined variational objective:
%
\begin{footnotesize}
\begin{equation}
\begin{split}
\mathcal{\widehat{F_{\boldsymbol{\theta}, \boldsymbol{\phi}}}}(\mathcal{D},\boldsymbol{\lambda}) \simeq \frac{1}{L} \sum_{j=1}^L \bigg[\sum_{i=1}^{N} \Big[ \log p_{\theta^{(j)}}(\mathbf{x}^{(i)},\mathbf{z})  -  \log q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)}) \Big] \bigg. \\ \bigg. +  \log p(\boldsymbol{\theta}^{(j)}) - \log q(\boldsymbol{\theta}^{(j)}|\boldsymbol{\lambda}) \bigg]
\end{split}
\end{equation}
\end{footnotesize}
%
where $\boldsymbol{\theta}^{(j)}$ is sampled from the posterior $q(\boldsymbol{\theta} \lvert \boldsymbol{\lambda})$, $\boldsymbol{z}$ is sampled from the posterior $q(\mathbf{\boldsymbol{z}} \lvert \mathbf{\boldsymbol{x}})$ and $N$ is taken equal to the batch size.


For the minimization objective, we use the negated estimate: $-\mathcal{\widehat{F_{\boldsymbol{\theta},\boldsymbol{\phi}}}}(\mathcal{D},\boldsymbol{\lambda})$. We assume a diagonal Gaussian distribution for both variational posteriors with parameters $\mu$ and $\sigma$. In order to make $\sigma$ be always non-negative we apply the same reparametrization as it was suggested by~\citet{blundell2015weight}, namely $\sigma = \log(1 + \exp(\rho))$, yielding the following posterior parameters $\lambda = (\mu, \rho)$. For the prior over the latent variable, we use the standard normal density, for the prior over the weights we use the scale mixture of two Gaussians as in~\citep{blundell2015weight}.

The usual reparametrization trick~\citep{kingma2013auto} is applied to both $\boldsymbol{\theta}$ and $\boldsymbol{z}$ for training by backpropagation.

\medskip \noindent \textbf{\textit{Stochastic Gradient Hamiltonian Monte Carlo.}}
%
This approach exploits sampling instead of optimization, which was the case with BBB. This sampling is done within the MCMC framework and is based on the proposals generated utilizing the Hamiltonian dynamics. Namely, assume that the posterior distribution:
%
\begin{footnotesize}
\begin{equation}
p(\boldsymbol{\theta}|\mathcal{D}) \propto \exp(-U(\boldsymbol{\theta}, \mathcal{D}))
\end{equation}
\end{footnotesize}
%
where $U(\boldsymbol{\theta}, \mathcal{D})$ stands for the potential energy function in the Hamiltonian.

In our work, we take:
%
\begin{footnotesize}
\begin{equation}
U(\boldsymbol{\theta}, \mathcal{D}) = -\log{p(\boldsymbol{\theta},\mathcal{D})} = -\sum_{i=1}^{N} \log p(\mathbf{x}^{(i)}|\boldsymbol{\theta}) - \log p(\boldsymbol{\theta})
\end{equation}
\end{footnotesize}
%
where $\log p(\mathbf{x}^{(i)}|\boldsymbol{\theta})$ is approximated by ELBO in our experiments and $\log p(\boldsymbol{\theta})$ is the prior over parameters.

Since HMC requires the computation of the gradient for the whole batch, a stochastic gradient alternative has been suggested by~\citet{chen2014stochastic} which relies on the noisy gradients and allows proposal generation in a faster mini-batch manner.

In our work, we also apply the improvements suggested by~\citet{springenberg2016bayesian} which significantly reduce the number of hyperparameters through adaptive estimates of the parameters in question during the burn-in procedure and subsequent training. Such an approach has been previously implemented in the unsupervised generative setting with VAEs by~\citet{daxberger2019bayesian}.

\medskip \noindent \textbf{\textit{Stochastic Weighted Averaging-Gaussian.}} SWAG is fitting the following Gaussian distribution:
%
\begin{footnotesize}
\begin{equation}
\mathcal{N}\left(\theta_{\mathrm{SWA}}, \widehat{\Sigma_{\mathrm{SWA}}}\right)
\end{equation}
\end{footnotesize}
%
\noindent where $\theta_{\mathrm{SWA}}$ is a running average over DNN parameters and $\Sigma_{\mathrm{SWA}}$ is the sample covariance matrix that after $T$ epochs can be calculated as:
%
\begin{scriptsize}
\begin{equation}
\theta_{\mathrm{SWA}}=\frac{1}{T} \sum_{i=1}^{T} \theta_{i}  \quad \textrm{and} \quad  \Sigma_{\mathrm{SWA}}=\frac{1}{T-1} \sum_{i=1}^{T}\left(\theta_{i}-\theta_{\mathrm{SWA}}\right)\left(\theta_{i}-\theta_{\mathrm{SWA}}\right)^{\top}
\end{equation}
\end{scriptsize}

Since $\Sigma_{\mathrm{SWA}}$ is of a very high rank it is approximated by the $K$ last epochs during training resulting in $\widehat{\Sigma_{\mathrm{SWA}}}$\citep{maddox2019simple}.

SWAG was previously applied only to the discriminative DNNs, in our work we implement it within a generative approach with VAEs.

\medskip \noindent \textbf{\textit{Combining several likelihoods.}}
%
After the approximation of the variational posterior over the weights, the usual practice is to estimate the expected likelihood, the exact form of which can be formulated as follows:
%
\begin{footnotesize}
\begin{equation}
p(\mathbf{x^*}|\mathcal{D}) =  \int p(\mathbf{x}|\theta)p(\theta|\mathcal{D})d\theta
\end{equation}
\end{footnotesize}

The unbiased estimate of which can be obtained like this:
%
\begin{footnotesize}
\begin{equation}
\mathbb{E}_{p(\theta|\mathcal{D})}[p(\mathbf{x^*}|\theta)] \simeq \frac{1}{N} \sum_{i=1}^{N}p(\mathbf{x}|\theta_i); \quad \textrm{where} \quad \theta \sim p(\theta|\mathcal{D})
\end{equation}
\end{footnotesize}
%
$p(\mathbf{x}|\theta_i)$ is computed by importance sampling as in Equation~\ref{eq_importance_sampling}. As soon as the expected likelihood is estimated, one can apply a threshold that would distinguish if the considered input adheres to the in-distribution sample or not.

In~\citep{choi2018waic} the likelihoods returned by several generative models are used to estimate the WAIC:
%
\begin{footnotesize}
\begin{equation}
WAIC(\mathbf{x^*}|\mathcal{D}) = \mathbb{E}_{p(\theta|\mathcal{D})}[p(\mathbf{x}|\theta)] - Var_{p(\theta|\mathcal{D})}[p(\mathbf{x}|\theta)]
\end{equation}
\end{footnotesize}

WAIC estimates the gap between the expected likelihood and the variance between the obtained likelihoods, which should benefit the small variance cases.

Another alternative is calculating the disagreement score $D[\cdot]$ suggested by~\citet{daxberger2019bayesian}. This score measures the variation in the likelihoods $\{p(\mathbf{x^*}|\boldsymbol{\theta}_i)_{i=1}^{N}\}$ which captures the uncertainty of the models within the ensemble about the particular input:
%
\begin{footnotesize}
\begin{equation}
D_{\Theta}[\mathbf{x^*}] = \frac{1}{\sum_{\theta \in \Theta}w_{\theta}^2}; \quad \textrm{where} \quad w_{\theta} = \frac{p(\mathbf{x^*}|\boldsymbol{\theta})}{\sum_{\theta \in \Theta}p(\mathbf{x^*}|\boldsymbol{\theta})}
\end{equation}
\end{footnotesize}

The lower the score, the more informative the input is about the parameters $\boldsymbol{\theta}$, and consequently, the uncertainty value is higher.

The weights represent the normalized likelihoods between 0 and 1. The disagreement score sums up the squares of the weights and takes the reciprocal. If the score is large, it means that all models return close values of the likelihoods. On the contrary, if the score is 1, then there is one model that dominates.

Finally, \citet{nalisnick2019detecting} conjectured that due to the high dimensionality of inputs, the over-confidence of DGMs may be due to the fact that in-distribution images lie in the typical set in contrast to the tested OoDs that concentrate in the high-density region. Based on this conjecture, they introduced the test for typicality that treats all input sequences of length $M$ as inliers if their entropy is sufficiently close to the entropy of the model, i.e., if the following holds for small $\epsilon$, then the given $M$-sequence is in-distribution:
%
\begin{footnotesize}
\begin{equation}
\begin{split}
\left|\frac{1}{M} \sum_{m=1}^{M}-\log p\left({\mathbf{x}}_{m} ; \boldsymbol{\theta}\right)-\mathbb{H}[p(\mathbf{x} ; \boldsymbol{\theta})]\right| \leq \epsilon
\end{split}
\end{equation}
\end{footnotesize}

We applied this score to one-element sequences since it is the most realistic scenario in practical applications of OoD detection.

\medskip \noindent \textbf{\textit{Our scores.}}
Based on the results of the experiments with the available metrics on all of the considered methods, we decided to apply our scores that capture the variation. In our work, we apply two simple scores for the same purpose. First, we measure the information entropy of the normalized likelihoods, namely:
%
\begin{footnotesize}
\begin{equation}
\mathbb{H}_{\Theta}[\mathbf{x^*}] = - \sum_{\theta \in \Theta}w_{\theta}\log{w_{\theta}} ; \quad \textrm{where} \quad w_{\theta} = \frac{p(\mathbf{x^*}|\boldsymbol{\theta})}{\sum_{\theta \in \Theta}p(\mathbf{x^*}|\boldsymbol{\theta})}
\end{equation}
\end{footnotesize}

It is a standard information-theoretic metric that measures the average information of the distribution: the lower the entropy, the more one of the models is confident about the predicted value. The entropy measure is applied to the normalized likelihoods. Such normalization may be considered as the categorical distribution over the obtained marginal likelihoods of the models.

Secondly, we calculate the sample standard deviation of the marginal log-likelihoods returned by the models within the ensemble:
%
\begin{footnotesize}
\begin{equation}
\Sigma_{\Theta}[\mathbf{x^*}] = \sqrt{\frac{1}{N-1} \sum_{{\theta \in \Theta}} (\log p(\mathbf{x^*}|\boldsymbol{\theta}) - \overline{\log p(\mathbf{x^*}|\boldsymbol{\theta})})^2}\end{equation}
\end{footnotesize}

It measures the variation within the log-likelihoods directly without normalizing step as in the case of the entropy, so if the variation persists along with the considered methods and datasets, the standard deviation will capture this difference: the higher the value, the more uncertainty there is between the models about a particular input.

\medskip \noindent \textbf{\textit{Thresholding.}}
There remains an open question of the appropriate threshold selection for the model evaluation. Since we are working in the unsupervised setting, intuitively, the ideal situation would be some threshold between the values of the scores returned by the model that successfully divides the inputs into OoDs vs. IDs. In order to validate the efficiency of the scores in achieving this task, we tackle this problem in the same way as~\citet{hendrycks2016baseline} by using three different metrics: \textit{Area Under the Receiver Operating Characteristic Curve} (AUROC), the \textit{Area Under Precision-Recall} (AUPR) curve, and the \textit{False-Positive Rate at 80\% of True-Positive Rate} (FPR80). These metrics are threshold-independent because they compute the true positives and false positives for all possible thresholds providing a final single value of the efficiency of the used non-thresholded decision values in dividing them into two separate classes. 

% We tackle this problem in the same way as~\citet{hendrycks2016baseline} by applying threshold-independent metrics (i.e., metrics that are calculated for all possible thresholds) such as the \textit{Area Under the Receiver Operating Characteristic Curve} (AUROC),  the \textit{Area Under Precision-Recall curve} (AUPR), and the \textit{False-Positive Rate at 80\% of True-Positive Rate} (FPR80)~\citep{davis2006relationship}.

\section{Evaluation}
\label{sec:evaluation}

\begin{table*}[t]
\sisetup{detect-weight,mode=text}
\centering
\caption{Scoring values across all types of \emph{Bayesian} VAEs trained on Fashion-MNIST data and tested on MNIST as OoD}
\label{table:FMNIST}
\resizebox{\textwidth}{!}{
\begin{tabular}{lSSSSSSSSS}
\toprule
& \multicolumn{9}{c}{\textit{\textbf{Fashion-MNIST vs. MNIST}}} \\
\cmidrule{2-10}
& \multicolumn{3}{c}{\textbf{BBB}} & \multicolumn{3}{c}{\textbf{SGHMC}} & \multicolumn{3}{c}{\textbf{SWAG}} \\
\cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}
& \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} \\
\midrule
\textbf{Expected LL} & 40.43 & 45.46 & 95.20 & 40.43 & 45.18 & 94.99 & 25.09 & 38.01 & 99.54 \\
\textbf{WAIC} & 59.53 & 59.35 & 71.88 & 55.79 & 53.86 & 74.56 & 19.90 & 35.14 & 99.83 \\
\textbf{Typicality test} & 40.51 & 43.40 & 86.36 & 41.02 & 43.85 & 86.05 & 56.40 & 50.32 & 64.88 \\
\textbf{Disagreement score} & 96.44 & 97.22 & 1.11 & 95.25 & 96.31 & 2.50 & 79.98 & 80.74 & 38.24 \\
\textbf{Entropy (ours)} & 97.97 & 98.43 & \cellcolor[rgb]{0.9,0.9,0.9} \B 0.19 & 97.28 & 97.92 & \cellcolor[rgb]{0.9,0.9,0.9} \B 0.53 & \cellcolor[rgb]{0.9,0.9,0.9} \B 82.50 & \cellcolor[rgb]{0.9,0.9,0.9} \B 84.05 & \cellcolor[rgb]{0.9,0.9,0.9} \B 35.61 \\
\textbf{Stds of LLs (ours)} & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.64 & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.55 & 0.34 & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.56 & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.50 & 0.55 & 19.90 & 34.22 & 94.42  \\
\bottomrule
\end{tabular}
}
\end{table*}

\begin{table*}[t]
\sisetup{detect-weight,mode=text}
\centering
\caption{Scoring values across all types of \emph{Bayesian} VAEs trained on CIFAR-10 data and tested on SVHN as OoD}
\label{table:CIFAR10}
\resizebox{\textwidth}{!}{
\begin{tabular}{lSSSSSSSSS}
\toprule
& \multicolumn{9}{c}{\textit{\textbf{CIFAR-10 vs. SVHN}}} \\
\cmidrule{2-10}
& \multicolumn{3}{c}{\textbf{BBB}} & \multicolumn{3}{c}{\textbf{SGHMC}} & \multicolumn{3}{c}{\textbf{SWAG}} \\
\cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}
& \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} \\
\midrule
\textbf{Expected LL} & 59.73 & 53.27 & 58.99 & 60.39 & 53.74 & 58.08 & 60.31 & 53.51 & 57.05 \\
\textbf{WAIC} & 61.15 & 54.22 & 57.15 & 62.39 & 55.38 & 55.07 & 64.29 & 55.81 & 50.59 \\
\textbf{Typicality test} & 63.73 & 60.89 & 65.53 & 64.44 & 61.05 & 64.01 & 64.93 & 61.52 & 64.33 \\
\textbf{Disagreement score} & 81.16 & 84.82 & 38.47 & 80.41 & 83.00 & 40.61 & 73.27 & 76.95 & 54.95 \\
\textbf{Entropy (ours)} & 84.76 & \cellcolor[rgb]{0.9,0.9,0.9} \B 88.21 & 29.31 & 84.56 & 86.90 & 29.12 & \cellcolor[rgb]{0.9,0.9,0.9} \B 76.51 & \cellcolor[rgb]{0.9,0.9,0.9} \B 80.54 & 49.37 \\
\textbf{Stds of LLs (ours)} & \cellcolor[rgb]{0.9,0.9,0.9} \B 89.98 & 85.83 & \cellcolor[rgb]{0.9,0.9,0.9} \B 16.03 & \cellcolor[rgb]{0.9,0.9,0.9} \B 92.52 & \cellcolor[rgb]{0.9,0.9,0.9} \B 91.48 & \cellcolor[rgb]{0.9,0.9,0.9} \B 12.27 & 71.26 & 64.65 & \cellcolor[rgb]{0.9,0.9,0.9} \B 44.34 \\
\bottomrule
\end{tabular}
}
\end{table*}

\begin{table*}[t]
\sisetup{detect-weight,mode=text}
\centering
\caption{Scoring values across all types of \emph{Bayesian} VAEs trained on MNIST data and tested on Fashion-MNIST as OoD}
\label{table:MNIST}
\resizebox{\textwidth}{!}{
\begin{tabular}{lSSSSSSSSS}
\toprule
& \multicolumn{9}{c}{\textit{\textbf{MNIST vs. Fashion-MNIST}}} \\
\cmidrule{2-10}
& \multicolumn{3}{c}{\textbf{BBB}} & \multicolumn{3}{c}{\textbf{SGHMC}} & \multicolumn{3}{c}{\textbf{SWAG}} \\
\cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}
& \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} \\
\midrule
\textbf{Expected LL} & 99.98 & 99.98 & 0.00 & 99.93 & 99.92 & 0.04 &\cellcolor[rgb]{0.9,0.9,0.9} \B 96.83 & \cellcolor[rgb]{0.9,0.9,0.9} \B 96.20 & \cellcolor[rgb]{0.9,0.9,0.9} \B 5.18 \\
\textbf{WAIC} & 99.99 & 99.99 & 0.00 & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.94 & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.94 & 0.02 & 80.37 & 76.25 & 33.56 \\
\textbf{Typicality test} & 99.98 & 99.98 & 0.00 & 99.88 & 99.90 & 0.00 & 94.91 & 96.47 & 1.58 \\
\textbf{Disagreement score} & 98.95 & 99.01 & 0.23 & 97.32 & 97.70 & 1.37 & 94.88 & 93.97 & 8.99 \\
\textbf{Entropy (ours)} & 99.42 & 99.47 & 0.02 & 98.50 & 98.75 & 0.29 & 95.72 & 95.20 & 8.37\\
\textbf{Stds of LLs (ours)} & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.99 & \cellcolor[rgb]{0.9,0.9,0.9} \B 99.99 & \cellcolor[rgb]{0.9,0.9,0.9} \B 0.00* & 99.91 & 99.91 & \cellcolor[rgb]{0.9,0.9,0.9} \B 0.00* & 80.37 & 82.78 & 39.12  \\
\bottomrule
\end{tabular}
}
{\scriptsize \textbf{*} 0's are possible since it is a value for false-positive rate at 80\% of true-positive rate}
\end{table*}

\begin{table*}[t]
\sisetup{detect-weight,mode=text}
\centering
\caption{Scoring values across all types of \emph{Bayesian} VAEs trained on SVHN data and tested on CIFAR-10 as OoD}
\label{table:SVHN}
\resizebox{\textwidth}{!}{
\begin{tabular}{lSSSSSSSSS}
\toprule
& \multicolumn{9}{c}{\textit{\textbf{SVHN vs. CIFAR-10}}} \\
\cmidrule{2-10}
& \multicolumn{3}{c}{\textbf{BBB}} & \multicolumn{3}{c}{\textbf{SGHMC}} & \multicolumn{3}{c}{\textbf{SWAG}} \\
\cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}
& \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} & \textbf{ROC AUC$\uparrow$} & \textbf{AUPRC$\uparrow$} & \textbf{FPR80$\downarrow$} \\
\midrule
\textbf{Expected LL} & 58.65 & 61.79 & 77.72 & 57.09 & 60.56 & 80.18 & 58.98 & 62.06 & 76.52 \\
\textbf{WAIC} & 64.46 & 66.01 & 68.39 & 62.17 & 64.38 & 72.45 & 62.84 & 68.42 & 75.25 \\
\textbf{Typicality test} & 44.63 & 44.28 & 81.46 & 43.35 & 43.63 & 82.45 & 44.28 & 44.13 & 81.96 \\
\textbf{Disagreement score} & 85.20 & 88.35 & 30.26 & 85.31 & 88.52 & 28.66 & 77.58 & 80.36 & 45.60 \\
\textbf{Entropy (ours)} & 87.80 & 90.63 & 20.77 & 87.89 & 90.76 & 19.91 & \cellcolor[rgb]{0.9,0.9,0.9} \B 80.01 & \cellcolor[rgb]{0.9,0.9,0.9} \B 83.24 & \cellcolor[rgb]{0.9,0.9,0.9} \B 41.58 \\
\textbf{Stds of LLs (ours)} & \cellcolor[rgb]{0.9,0.9,0.9} \B 93.29 & \cellcolor[rgb]{0.9,0.9,0.9} \B 91.51 & \cellcolor[rgb]{0.9,0.9,0.9} \B 10.99 & \cellcolor[rgb]{0.9,0.9,0.9} \B 94.70 & \cellcolor[rgb]{0.9,0.9,0.9} \B 93.95 & \cellcolor[rgb]{0.9,0.9,0.9} \B 8.67 & 59.31 & 53.36 & 61.78 \\
\bottomrule
\end{tabular}
}
\end{table*}


\begin{figure*}[!ht]
\centering 
\subfigure{\includegraphics[width=0.5\columnwidth]{figures/BBB_MNIST_z10_EntropiesAuroc093Aupr095Fpr007_skipEpochs800TestOnBatches40_cropped.pdf}}
\hfill
\subfigure{\includegraphics[width=0.5\columnwidth]{figures/BBB_FMNIST_z10_EntropiesAuroc098Aupr098Fpr000_skipEpochs800TestOnBatches40_cropped.pdf}}
\hfill
\subfigure{\includegraphics[width=0.5\columnwidth]{figures/BBB_SVHN_z20_EntropiesAuroc088Aupr091Fpr021_skipEpochs800TestOnBatches40_cropped.pdf}}
\hfill
\subfigure{\includegraphics[width=0.5\columnwidth]{figures/BBB_CIFAR10_z70_EntropiesAuroc085Aupr088Fpr029_skipEpochs800TestOnBatches40_cropped.pdf}}
\hfill
\subfigure{\includegraphics[ width=0.47\columnwidth]{figures/SGHMC_MNIST_z10_EntropiesAuroc099Aupr099Fpr000_skipEpochs800TestOnBatches40_XAxisRescaled-crop.pdf}}
\hfill
\subfigure{\includegraphics[width=0.5\columnwidth]{figures/SGHMC_FMNIST_z10_EntropiesAuroc097Aupr098Fpr001_skipEpochs800TestOnBatches40_cropped.pdf}}
\hfill
\subfigure{\includegraphics[width=0.5\columnwidth]{figures/SGHMC_SVHN_z20_EntropiesAuroc088Aupr091Fpr020_skipEpochs800TestOnBatches40_cropped.pdf}}
\hfill
\subfigure{\includegraphics[width=0.5\columnwidth]{figures/SGHMC_CIFAR10_z70_EntropiesAuroc085Aupr087Fpr029_skipEpochs800TestOnBatches40_cropped.pdf}}
\caption{Histograms of the entropies of the marginal likelihoods. LLs are estimated based on sampling from the \emph{Bayesian} VAEs, blue depicts in-distribution (ID) and orange - out-of-distribution (OoD).
\textbf{From left to right}: MNIST as ID vs Fashion-MNIST as OoD, Fashion-MNIST as ID vs MNIST as OoD, SVHN as ID vs CIFAR-10 as Ood, CIFAR-10 as ID vs SVHN as OoD.
\textbf{Top:} Sampling is done from Bayes-by-backprop VAE. 
\textbf{Bottom:} Sampling is done from SGHMC VAE.}
\label{fig:Entropies}
\end{figure*}

We run all of our experiments on the four image datasets: MNIST~\citep{lecun-mnisthandwrittendigit-2010}, Fashion-MNIST~\citep{xiao2017/online}, SVHN~\citep{Netzer2011}, and CIFAR-10~\citep{CIFAR10}. As it has been observed~\citep{ren2019likelihood, nalisnick2018deep}, the likelihood estimations may be misled by the dataset they have been trained on. For instance, if the model has been trained on MNIST and OoD detection has been performed on the Fashion-MNIST, then the researcher, only by chance, may obtain good results. To avoid such mistakes, we train our models on all the datasets and check the following in-distribution vs. OoD: MNIST vs. Fashion-MNIST and vice versa, SVHN vs. CIFAR-10 and vice versa.

The following hardware infrastructure was used in all of our experiments: Xeon Platinum 8160 2.1 GHz 32 GB of RAM, 1 GPU NVIDIA Volta V100.

First, we estimate the impact of the latent space's number of dimensions on the loss function. The dimensionality is closely connected with the dataset the model is trained on. MNIST and FashionMNIST results reveal no need to go over 10 latent dimensions since loss function did not significantly decrease after that value. For SVHN and CIFAR-10, we experimented with the number of latent dimensions up to 100; the most optimal results have been achieved with dimensionality equals 20 for SVHN and 70 for CIFAR-10.

We experimented with two different architectures for all our tests: one for the grayscale images and the second for the RGB images with 1 and 3 channels correspondingly. All models have been trained for 1000 epochs. To evaluate the inputs, we sampled 200 different models for our ensemble and evaluated them on a separate test data split of 5120 images that models have not been trained on. We used test splits of both ID and OoD datasets for all scores and metrics.

For our implementation of Bayes by backpropagation, we noticed that random normal initializer of the DNNs weights suggested as a prior in the original paper by~\citet{blundell2015weight} resulted in very slow convergence. To speed up the process, we also experimented with the following parameters: random normal initializer with $0$ mean and 0.1 standard deviations for $\mu$ and constant initializer for $\rho = -3$, which improved the training speed~\citep{krishnan2020bayesiantorch}.\footnote{For BBB we used the PyTorch \emph{Bayesian} layers available at \url{https://github.com/IntelLabs/bayesian-torch}}

In case of SGHMC we adhere to the same protocol as in~\citet{daxberger2019bayesian}, namely, we use the same scale-adapted sampler implementation with learning rate $10^{-3}$ for the training and momentum decay 0.05.\footnote{The SGHMC sampler that we used is available at \url{https://github.com/automl/pybnn}} We also place Gaussian priors over decoder parameters with precision $p(\mathbf{\theta}) = \mathcal{N}(0, \lambda^{-1})$ and with Gamma hyperprior over the precision $p(\lambda) = \Gamma(\alpha, \beta)$ with $\alpha = \beta = 1$ that are resampled on each epoch.

For the experiments with SWAG, we set $K=40$ and kept the default values for all the rest hyperparameters as in the original SWAG implementation.\footnote{SWAG sampler that we used is available at~\url{https://github.com/wjmaddox/swa_gaussian}}

All the experiments are done within the framework implemented by ~\citet{nielsen2020survae}. The results of the experiments against the benchmark scores for all datasets and models can be observed in the Tables~\ref{table:FMNIST}~-~\ref{table:SVHN}.


\section{Discussion}

As shown in the tables, \emph{Bayesian} methods used in the discriminative approach can be successfully transferred to the generative models such as VAEs for OoD detection. Moreover, from the point of view of the scores: the variation among the model densities turns out to be persistent across all of the types of the \emph{Bayesian} VAEs and all the datasets. The best results are achieved for BBB and SGHMC types of VAEs. The simple entropy score consistently demonstrates state-of-the-art results while detecting OoDs by better capturing the variation compared to the previously introduced baseline scores, the histograms for both BBB and SGHMC VAE results are shown in Figure~\ref{fig:Entropies}. In addition, the sample standard deviation significantly outperforms the entropy score in the case of BBB and SGHMC methods.
 
Interestingly, almost all scores achieve comparably good results when trained on the MNIST dataset and tested on Fashion-MNIST (Table~\ref{table:MNIST}), but many of the baseline scores demonstrate substantially worse values when the experiments are conducted the other way around, i.e., trained on Fashion-MNIST and tested on MNIST (Table~\ref{table:FMNIST}). The reason is that many of these scores are biased to a particular type of data. However, the bi-directional experiments easily identify these biases among all datasets and benchmark scores.

It can be observed that the worst-performing scores either intrinsically depend on the mean of the ensemble (such as WAIC) or on the log-likelihood itself returned by the model (such as a typicality score). In the first case, it results in the dominance of the variation of the particular values of the likelihoods for different inputs over the variation between the models within the ensemble for a single input, e.g., the range of variance of estimated likelihoods in the case of Fashion-MNIST is at least twice greater than in case of MNIST. In general, the more complex dataset is used for model training, the greater variance of the resulting likelihoods that the model assigns to the inputs; hence, there is potentially less influence of the variation between the models within the ensemble (that is completely lost in the case of WAIC for example). On the other hand, our scores measure the variance \emph{within} the ensemble. It allows catching even a slight difference in such variation. In the second case with typicality, the log-likelihoods of inputs are used directly without any ensembling, which is susceptible to the well-known issue with modern deep generative models discovered by~\citet{nalisnick2018deep}. The same applies to the expected log-likelihood metric.

From the point of view of the speed performance, we consider the runtime required for the training convergence to get the same values of the likelihoods as for the vanilla VAEs. In such a case, the overhead of SWAG is almost negligible compared with the vanilla VAE. BBB and SGHMC, on the contrary, both take much longer time, with BBB requiring up to five times longer than the vanilla VAE training. SGHMC performs relatively faster than BBB but still lags far behind SWAG. There is also a clear tradeoff between the training performance and resulting accuracy in distinguishing between OoD vs. ID inputs: the fastest method in training (i.e., SWAG) results in the lowest OoD detection scores; however, the much slower training method (i.e., SGHMC) results in the best OoD detection scores for the most complex dataset that we experimented with (i.e., CIFAR-10). BBB turns out to be the slowest in training, and the results for OoDs are slightly worse than those obtained by SGHMC. Since SWAG has a minor overhead during training, it implies better scalability of this method to bigger datasets compared with SGHMC or BBB. If we also consider the speed performance from the point of view of the scalability to the bigger models, then it becomes clear that all of the methods rely on sampling the weights, so there is no clear winner, i.e., the more parameters a particular DNN would have the slower sampling would be for all of the methods.

It should be emphasized that these discrepancies in speed performance between the suggested \emph{Bayesian} approaches can be seemingly treated as limitations in comparison with the vanilla VAEs and with the different ensemble techniques. First, let us consider the time required for the convergence of the model. As mentioned above, it may take five times longer in the worst case to obtain the values comparable to the ones of the vanilla VAE for both components of the ELBO loss. However, this disadvantage is disappearing, and the proposed \emph{Bayesian} methods become even advantageous when one is training an ensemble of separate models $\{p(\mathbf{x^*}|\boldsymbol{\theta}_i)\}_{i=1}^{N}$ with $N > 5$ under the similar hardware constraints. 
Second, suppose we look at the resulting detection of OoD, which is slowed down with the several estimations of the marginal likelihood within the ensemble and subsequent calculation of the required score. In that case, this limitation primarily concerns the performance comparison with a single DGM without using ensembles. However, as we mentioned before, such a point estimate cannot be reliably used to estimate epistemic uncertainty. Therefore, if we compare with the traditional ensembling techniques, the main difference with the suggested approach stems from the way \emph{how} the estimated marginal likelihoods are used. The former computes the expected likelihood, and the latter calculates the entropy or sample standard deviation of the marginal log-likelihoods, which means that we get $\mathcal{O}(N)$ computations in either of these scenarios.


\section{Conclusion}

The ability to detect OoD inputs by DGMs is of significant importance for robust inference, especially in practical applications. Our work concentrates on a specific type of such DGMs: Bayesian VAEs. We addressed this issue from three different perspectives:

\begin{enumerate}
\item \textbf{\textit{Method-wise.}} We implemented three methods for estimating epistemic uncertainty in the generative setting based on VAEs utilizing \emph{Bayesian} inference over model parameters: BBB based on variational inference, SGHMC based on Monte-Carlo sampling, and SWAG based on the noise in the SGD. Most methods have been previously applied exclusively to the discriminative models, and our paper bridges this gap between two modeling approaches.

\item \textbf{\textit{Score-wise.}} We benchmarked all methods against the frequently used OoD benchmarks: expected log-likelihood, WAIC, disagreement score, and typicality test. Moreover, during our experiments, we noticed that the most promising score was based on the idea of the variation of marginal likelihoods. Built on that, we proposed using two simple scores: one is based on the information entropy, and the second is on the standard deviation for the robust unsupervised OoD detection. We achieved state-of-the-art results with them across all the benchmarked methods and considered datasets.

\item \textbf{\textit{Experiment-wise.}} We did thorough experiments with all methods and scores on several datasets. Moreover, to avoid potential errors, we evaluated the results bi-directionally, e.g., if we trained a model on the MNIST dataset and used Fashion-MNIST as OoD, then we also trained a model on the Fashion-MNIST dataset and checked its ability to detect MNIST inputs as OoD. Such a check is necessary to avoid the bias of any particular scoring method to either more complicated or more simplified data.
\end{enumerate}

The results of the experiments convincingly support the idea of the beneficial usage of the epistemic uncertainty estimation based on the variation for successful OoD detection in the case of VAEs. We observed that both BBB and SGHMC demonstrated comparable performance. While SWAG was always worse for the new OoD scores compared with other methods, we still conclude that it can be used as a simple baseline for epistemic uncertainty in the case of VAEs in the same manner as in the case of the discriminative approach. Moreover, from the point of view of the training convergence, SWAG turned out to be the fastest among the all considered \emph{Bayesian} methods. Future work may revolve around a deeper understanding of the sources of variation within the ensemble from the point of view of the latent space of the VAE: e.g., is there a correlation between ``holes'' in the latent manifold and greater variance of the likelihoods.

\begin{acknowledgements}
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 883275 (HEIR).
\end{acknowledgements}

\bibliography{glazunov_583.bib}


\appendix
% NOTE: necessary when ptmx or no mathfont class option is given
\providecommand{\upGamma}{\Gamma}
\providecommand{\uppi}{\pi}


\end{document}
