\RequirePackage{algorithm}
\RequirePackage{algorithmic}

\documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


%Our imports
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\usepackage{svg}

\usepackage{hyperref}


% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}
% \usepackage{algcompatible}

% \usepackage{algorithm}
% \usepackage{algpseudocode}

% % Attempt to make hyperref and algorithmic work together better:
% \usepackage{algorithm}
% \usepackage{algorithmic}
% % \usepackage{algorithm2e}

% % \newcommand{\theHalgorithm}{\arabic{algorithm}}
% \usepackage{algcompatible}
% \usepackage{algpseudocode}

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage[normalem]{ulem}

\usepackage{xcolor}
\newcommand{\cblue}{\textcolor{blue}}
\newcommand{\cred}{\textcolor{red}}
\newcommand{\cgreen}{\textcolor{green}}
\newcommand{\cmag}{\textcolor{magenta}}

% Definitions
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}


% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
% \usepackage[textsize=tiny]{todonotes}


%Eli
\newcommand{\EF}[1]{{\color{green}{EF: #1}}}
\newcommand{\nic}[1]{{\color{blue}{ND: #1}}}


\title{Deep Gaussian Mixture Ensembles}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Yousef El-Laham}
\author[1]{Niccol\`o Dalmasso}
\author[1]{Elizabeth Fons}
\author[1]{Svitlana Vyetrenko}
% Add affiliations after the authors
\affil[1]{%
    J.P. Morgan AI Research, New York, USA
}
% \affil[2]{%
%     Second Affiliation\\
%     Address\\
%     …
% }
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }
  
\begin{document}
\maketitle


\begin{abstract}
This work introduces a novel probabilistic deep learning technique called deep Gaussian mixture ensembles (DGMEs), which enables accurate quantification of both epistemic and aleatoric uncertainty. By assuming the data generating process follows that of a Gaussian mixture, DGMEs are capable of approximating complex probability distributions, such as heavy-tailed or multimodal distributions. Our contributions include the derivation of an expectation-maximization (EM) algorithm used for learning the model parameters, which results in an upper-bound on the log-likelihood of training data over that of standard deep ensembles.  Additionally, the proposed EM training procedure allows for learning of mixture weights, which is not commonly done in ensembles. Our experimental results demonstrate that DGMEs outperform state-of-the-art uncertainty quantifying deep learning models in handling complex predictive densities.
\end{abstract}

\section{Introduction}
\label{introduction}
Uncertainty quantification plays a key role in the development and deployment of machine learning systems, especially in applications where user safety and risk assessment are of paramount importance \citep{abdar2021review}. While deep learning (DL) has cemented its superiority in terms of raw predictive performance for a variety of applications, the principled incorporation of uncertainty quantification in DL models remains an open challenge.  Since standard DL models are unable to properly quantify predictive uncertainty, one common challenge for deep learning models is detecting out-of-distribution (OOD) inputs. It is often the case that OOD inputs lead a DL model to making erroneous predictions \citep{ovadia2019can}. Without uncertainty quantification, one cannot reason about whether an input is OOD and this can be catastrophic in applications such as machine-assisted medical decision making \citep{begoli2019need} or self-driving vehicles \citep{michelmore2018evaluating}. Moreover, uncertainty quantification can also be used as a means to select samples to label in active learning and for enabling exploration in reinforcement learning algorithms \citep{clements2019estimating, charpentier2022disentangling}. 


Uncertainty in machine learning models is derived from two different sources: aleatoric uncertainty and epistemic uncertainty \citep{kendall2017uncertainties, hullermeier2019aleatoric, valdenegro2022deeper}. Aleatoric uncertainty derives from measurement process of the data, while epistemic uncertainty derives from the uncertainty in the parameters of the machine learning model. A variety of approaches have been proposed to quantify both types of uncertainty in DL models from both a Bayesian and frequentist perspective; we refer the reader to \cite{gawlikowski2021survey} for a comprehensive review. Under the Bayesian paradigm, the goal is to infer the posterior predictive density of the target variable given the input features and the training data, which encodes both types of uncertainty. Unfortunately, exact Bayesian inference algorithms (e.g., \citealt{neal2012bayesian}) cannot scale to the parameter space of modern DL architectures and one often has to resort to mini-batching \citep{chen2014stochastic} or forming a rough parametric approximation of the posterior distribution of the parameters, such as the Laplace approximation \citep{daxberger2021laplace} or stochastic variational inference \citep{graves2011practical, hoffman2013stochastic}. The drawback of the parametric approach is the inability to express more complex (e.g., heavy-tailed or multimodal) predictive distributions. As an example, approximations such as mean-field variational inference form a Gaussian predictive distribution that tends to underestimate the true uncertainty of more complex models.

In recent years, there have been developments in probabilistic DL which exploit the inherent stochasticity in learning to quantify predictive uncertainty. Examples include techniques such as probabilistic backpropagation \citep{hernandez2015probabilistic}, Monte Carlo dropout (MCD, \citealt{gal2016dropout}), Monte Carlo batch normalization \citep{teye2018bayesian}, deep ensembles (DEs, \citealt{lakshminarayanan2017de})  among others. MCD and DEs have emerged as state-of-the-art solutions for quantifying uncertainty in DL models due to their simplicity and effectiveness. MCD utilizes the inherent stochasticity of dropout (i.e., random masking of neural network weights) to form an ensemble-based approximation of the predictive distribution through multiple stochastic forward passes of the model to account for epistemic uncertainty. Aleatoric uncertainty is handled as a post-processing step under the assumption that the underlying data noise is homoscedastic.  DEs, on the other hand, independently train a small ensemble of dual-output neural networks, where the outputs characterize the mean and variance of the predictive distribution. Each network in the ensemble is independently trained to maximize the likelihood of the data under the  heteroscedastic Gaussian assumption. At test time, the networks are linearly combined into a single Gaussian approximation of the predictive distribution. Unfortunately, both MCD and DEs are not adequate solutions for modeling more complex data distributions (e.g., heavy-tailed or multimodal distributions). 


{\bf Contributions.} Our contribution are as follows:

\begin{enumerate}
    \item We propose a novel probabilistic DL technique called {\it deep Gaussian mixture ensembles} (DGMEs) for jointly quantifying epistemic and aleatoric uncertainty. DGMEs train a weighted deep ensemble using the expectation maximization algorithm;
    % \item We show DGMEs optimizes the joint data likelihood directly, unlike deep ensembling that targets a lower bound of the data likelihood. As a consequence, DGMEs achieving superior loss to deep ensembling, which is corroborated by our experimental results;
     \item We show DGMEs optimizes the joint data likelihood directly, unlike deep ensembling that targets a lower bound of the data likelihood. As a consequence, DGMEs achieve a superior loss to deep ensembling, which is corroborated by our experimental results;
    \item  We empirically show that our model is more expressive than standard probabilistic DL approaches and can capture both heavy-tailedness and multimodality.
\end{enumerate}



\section{Related Work} 


\textbf{Mixture Density Networks.}
Mixture density networks (MDNs, \citealt{bishop1994mixture}) use a deep neural network to simultaneously learn the means, variances and mixture weights of a Gaussian mixture model.
MDNs have been successfully used in many machine learning applications, such as computer vision~\citep{bazzani2016recurrent}, speech synthesis~\citep{zen2014deep}, probabilistic forecasting~\citep{zhang2020improved}, astronomy~\citep{d2018photometric}, chemistry~\citep{unni2020deep} and epidemiology~\citep{davis2020use}, among others. While MDNs are closely related to DGMEs in terms of uncertainty quantification, they are not an ensemble technique per se, as the epistemic and aleatoric uncertainty cannot be disentangled in MDNs. Moreover, without the ensemble structure, MDNs cannot easily be trained in a distributed setting, whereas the training of DGMEs can trivially be parallelized.

\textbf{Monte Carlo Dropout.}
MCD exploits the stochasticity of dropout training to quantify epistemic uncertainty in DL models. At test time, stochastic forward passes through a DL model with dropout produce ``approximate'' samples from the underlying posterior predictive distribution, which are typically summarized using first- and second-order moments (e.g, mean and variance of the samples). Aleatoric uncertainty is accounted for in a post-processing step, whereby the optimal homescedastic variance that maximizes the evidence lower-bound is obtained via cross-validation. MCD's popularity can be attributed to its simple implementation, as no changes to the standard DL training procedure are required. While vanilla MCD can yield favorable results in additive Gaussian settings, the method is less effective when dealing with more complex data generating processes (e.g., heavy-tailed or multimodal predictive densities). In this work, we incorporate MCD in our training procedure to account for epistemic uncertainty (see Section~\ref{sec:epistemic-uncertainty}).


\textbf{Deep Ensembles.}
DEs quantify both aleatoric and epistemic uncertainty by building an ensemble of independently trained models under different neural network weight initializations. Combined with adversarial training \citep{goodfellow2014explaining}, DEs achieve competitive or better performance than MCD in most settings in terms of calibration of predictive uncertainty and in terms of reasoning about OOD inputs. It has been argued in several works that DEs can be interpreted as a Bayesian approach, where the learned weights of each ensemble member correspond to a sample from the posterior distribution of the network weights \citep{wilson2020bayesian}. In recent years, new variations of deep ensembles have been proposed, such as anchored DEs \citep{pearce2020uncertainty}, deep-split ensembles \citep{sarawgi2020have}, and hybrid training approaches that combine DEs with the Laplace approximation \citep{hoffmann2021deep}. We emphasize that a key distinction between DEs and DGMEs is that each sample in DEs is treated as an i.i.d. sample from a Gaussian distribution, whereas DGMEs assume that data are distributed according to a Gaussian mixture. This gives DGMEs the key advantage of being able to learn more complex data generating processes. 


\textbf{Neural Expectation Maximization.}
Neural EM is a differentiable clustering technique that combines the principles of the EM algorithm with neural networks for representation learning, particularly in the field of computer vision for perceptual grouping tasks \citep{NIPS2017_NEM}. The goal of neural EM is to group the individual entities (i.e., pixels) of a given input (i.e., image) that belong to the same object. To do this, a finite mixture model is used to construct a latent representation of each image, where each mixture component represents a distinct object. A neural network is then used to transform the parameters of that mixture model into pixel-wise distributions over the image, allowing for reasoning about which object each pixel in the image belongs to. While neural EM combines the ideas of EM with deep learning, we emphasize that this is different from our work which focuses on the accurate quantification of predictive uncertainty in the supervised learning setting.




\section{Problem Formulation}
\label{problem_formulation}
Consider a set of training data $\mathcal{D}=\{(x_n, y_n)\}_{n=1}^N$, where $x_n\in\mathbb{R}^{d_x}$ is the feature vector and $y_n$ is the output, which can be real-valued if we are dealing with a regression task or integer-valued if we are dealing with a classification task. We would like to train a model that allows us to predict an output $y$ given its corresponding input vector $x$. 

From a probabilistic perspective, the goal is to determine the posterior predictive distribution $p(y|x, \mathcal{D})$. We assume a statistical model $p_{\theta}(y|x)\triangleq p(y|x, \theta)$ that relates each output to its corresponding feature vector through a set of parameters $\theta\in\Theta$. Then, the predictive distribution can be determined as
\begin{equation}
    \label{eq: predictive_distribution}
    p(y|x, {\cal D})=\int_{\Theta} p_{\theta}(y|x)p(\theta|\mathcal{D})d\theta. 
\end{equation}
While this integral is generally intractable, it can be approximated using a Monte Carlo average, where samples are taken from the posterior $p(\theta|{\cal D})$. Let $Y=\{y_1,\ldots,y_N\}$ and let $X=\{x_1,\ldots, x_N\}$. According to Bayes theorem, the posterior distribution $p(\theta|{\cal D})$ is
\begin{equation}
    \label{eq: bayes_theorem_post}
    p(\theta|\mathcal{D}) = \frac{p(Y|X,\theta)p(\theta)}{p(Y|X)}, 
\end{equation}
where $p(Y|X,\theta)=\prod_{n=1}^N p_{\theta}(y_n|x_n)$ is called the {\it data likelihood} under the i.i.d. assumption, $p(\theta)$ is the {\it prior distribution} of $\theta$, and $p(Y|X)=\int_{\Theta}p(Y|X,\theta)p(\theta)d\theta$ is called the {\it marginal likelihood}. The posterior can only be computed analytically when $p(\theta)$ is a conjugate prior for the likelihood function $p(Y|X, \theta)$. For deep learning models, an analytical solution to the posterior cannot be determined and one must resort to an approximation of the predictive distribution.

\paragraph{Goal.} The goal is to acquire an approximation of the posterior predictive distribution. Ideally, samples from the approximation form consistent estimators of key moments of the predictive distribution that allow one to (i) formulate predictions, (ii) identify the underlying stochastic risk associated with the prediction (e.g., aleatoric uncertainty), and (iii) reason about the model's uncertainty in the presence of the OOD data (i.e., epistemic uncertainty).


\section{Deep Gaussian Mixture Ensembles}
\label{proposed_method}
In this work, we propose DGMEs to effectively learn a mixture distribution that accurately represents the true conditional density of the labels given the features. Since Gaussian mixtures are universal approximators for smooth probability density functions \citep{bacharoglou2010approximation, goodfellow2016deep}, modeling the conditional density $p_{\theta}(y|x)$ as a Gaussian mixture allows for learning more complex distributions, such as skewed, heavy-tailed, and multimodal distributions. Under the assumption that our data follows a mixture distribution with $K$ mixture components, the conditional density of a particular example $(x, y)$ is given by: 
\begin{equation}
    \label{eq: mixture_of_gaussains_assumptions}
    p_{\theta}(y|x) = \sum_{k=1}^K \pi_k p_k(y|x, \theta_k),
\end{equation}
where $\theta_k\in\Theta_k\subseteq \mathbb{R}^{d_{\theta}}$ denotes the underlying parameters of the $k$-th mixture and $\pi_k$ denotes the weight of the $k$-th mixture and represents the probability that the example $(x, y)$ is distributed according to $p_k(y|x, \theta_k)$. Throughout the rest of the text, we refer to all unknown parameters in the mixture as $\theta=\{\pi_1, \theta_1, ..., \pi_K, \theta_K\}$. Hereafter, we consider the problem of learning the parameters of the mixture in \eqref{eq: mixture_of_gaussains_assumptions} in the context of regression. We discuss a possible extension to classification in the Supplementary Material, Section C.

To effectively model this mixture, we make the following assumptions:
\begin{assumption}
\label{assum: equally_weighted mixture}
The mixture weights $(\pi_1, \ldots, \pi_K)\in{\cal S}_K$ do not depend on the input features, where ${\cal S}_K$ denotes the $K$-dimensional probability simplex.
\end{assumption}
\begin{assumption}
\label{assum: gaussian_mixture_assumption}
The conditional density $p_k(y|x, \theta_k)$ is a Gaussian distribution whose parameters are modeled via parameterized functions (neural networks) dependent on $x$:
\begin{equation}
\label{eq: conditional_density_assumption}
p_k(y|x, \theta_k) = {\cal N}(y; \mu_{\theta_k}(x), \sigma^2_{\theta_k}(x)),
\end{equation}
where $\theta_k$ denote the parameters of functions $\mu_{\theta_k}(\cdot)$ and $\sigma^2_{\theta_k}(\cdot)$ that output the mean and variance of the $k$-th mixture, respectively. Importantly, these functions are assumed to share parameters, just as in the original work on DEs \citep{lakshminarayanan2017de}.
\end{assumption}
Under the above assumptions, learning the mixture representation of $p_{\theta}(y|x)$ is equivalent to learning the parameters $\theta$ to maximize the data likelihood of the training examples ${\cal D}=\{(x_n, y_n)\}_{n=1}^N$.

\subsection{Learning the Mixture Parameters}
We obtain the maximum likelihood (ML) estimate or maximum a posteriori (MAP) estimate of the unknown parameters $\theta$ using the EM algorithm. Let $Y=\{y_1, \ldots, y_N\}$ and $X=\{x_1, \ldots, x_N\}$. Furthermore, let $Z=\{z_1, \ldots, z_N\}$, where each $z_n\in\{1,\ldots, K\}$ is a latent variable that denotes membership assignment of the training example $(x_n, y_n)$ to a particular mixture component, where $\pi_k\triangleq P_{\theta}(z_n=k)$ is the probability that the example $(x_n, y_n)$ belongs to the $k$-th component. Assuming that the training examples are independent and identically distributed, we can write the joint likelihood as
\begin{align*}
&p_{\theta}(Y, Z|X) = \\
&\prod_{n=1}^N\prod_{k=1}^K \pi_k^{\mathbb{I}(z_n=k)}\mathcal{N}(y_n; \mu_{\theta_k}(x_n), \sigma_{\theta_k}^2(x_n))^{\mathbb{I}(z_n=k)},
\end{align*}
with corresponding log-likelihood of
\begin{align*}
&\log p_{\theta}(Y, Z|X) =  \sum_{n=1}^N\sum_{k=1}^K \mathbb{I}(z_n=k)(\log \pi_k + \ell_{\theta_k}(x_n, y_n))
\end{align*}
where $\ell_{\theta_k}(x, y)=\log\left(\mathcal{N}(y; \mu_{\theta_k}(x), \sigma_{\theta_k}^2(x))\right)$. Our goal is to solve the following optimization problem:
\begin{align}
    \label{eq: exact_data_likelihood_max}
    \theta^\star &= \argmax_{\theta} \log p_{\theta}(Y|X) \\
    &= \argmax_{\theta} \log \left(\mathbb{E}_{Z|X, Y, \theta} \left[p_{\theta}(Y, Z|X)\right]\right),
\end{align}
which we numerically solve using the EM algorithm. In the following, we describe both the expectation step (E-Step) and maximization step (M-Step) as it relates to our model. As a note, all results presented hereafter also apply to the more general problem of obtaining the MAP estimate of the parameters $\theta$.\footnote{That is, the maximizer of $\log p(Y, \theta|X)  = \log p_{\theta}(Y|X) + \log p(\theta)$, where $p(\theta)$ is the prior distribution of the mixture parameters.}

\paragraph{E-Step:} 
We update the posterior probabilities of each $z_n$ given the parameters $\theta$ and the example $(x_n, y_n)$ for each $n$, denoted by $\gamma_{k, n}\triangleq P_{\theta}(z_n=k|x_n, y_n)$. This can be done directly using Bayes' theorem:
\begin{align}
    \gamma_{k, n} &= \frac{p_{k}(y_n|x_n, \theta_k)P_{\theta}(z_n=k)}{\sum_{j=1}^K p_{j}(y_n|x_n, \theta_j)P_{\theta}(z_n=j)} \\
    &= \frac{\pi_k{\cal N}(y_n; \mu_{\theta_k}(x_n), \sigma_{\theta_k}^2(x_n))}{\sum_{j=1}^K \pi_j {\cal N}(y_n; \mu_{\theta_j}(x_n), \sigma_{\theta_j}^2(x_n))} \label{eq: posterior_updates}
\end{align}

\paragraph{M-Step:} The parameters $\theta$ are updated in the maximization step by maximizing the expected joint log-likelihood $Q(\theta, \theta')\triangleq \mathbb{E}_{Z|X, Y, \theta'} \left[\log p_{\theta}(Y, Z|X)\right]$ given the previous parameter values $\theta'$, which is equivalent to doing lower-bound maximization on the true log-likelihood \citep{minka1998expectation}. The function $Q(\theta, \theta')$ can be readily determined as:
\begin{equation} 
Q(\theta, \theta') = \sum_{n=1}^N\sum_{k=1}^K \gamma_{k, n}(\log(\pi_k) + \ell_{\theta_k}(x_n, y_n)).
\end{equation}
The optimization of the mixture weights $(\pi_1, \ldots, \pi_K)$ can be carried out analytically and done independently of optimizing the mixture parameters $\{\theta_1, \ldots, \theta_K\}$:
\begin{equation}
    (\pi_1^\star,\ldots, \pi_K^\star) = \argmax_{(\pi_1,\ldots,\pi_K)\in{\cal S}_K} Q(\theta, \theta'),
\end{equation}
where for each  $k$
\begin{equation}
    \pi_k^\star = \frac{1}{N}\sum_{n=1}^N \gamma_{k, n}.
\end{equation}
Since the mixture parameters are assumed to be parameterised by neural networks, their optimization must be carried out using stochastic optimization. It is easy to see that the optimization of each $\theta_k$ can be done independently as:
\begin{align} 
     &\hspace*{-0.1cm}\theta_k^\star = \argmax_{\theta_k\in\Theta_k} \sum_{n=1}^N \gamma_{k, n}\ell_{\theta_k}(x_n, y_n) \\
    &\hspace*{-0.1cm}= \argmin_{\theta_k\in\Theta_k} \sum_{n=1}^N \gamma_{k, n} \left(\log \sigma_{\theta_k}^2(x_n) + \frac{(y_n - \mu_{\theta_k}(x_n))^2}{\sigma_{\theta_k}^2(x_n)} \right) \label{eq: weighted_nll}
\end{align}
This optimization step can be thought as training a deep ensemble, where each sample $(x_n, y_n)$ is weighted by $\gamma_{k, n}$ in its negative log-likelihood contribution. 

\subsection{Implementation Details}

\begin{algorithm}[t]
   \caption{Deep Gaussian Mixture Ensembles (DGMEs)}
   \label{alg: dgme}
    \begin{algorithmic}[1]
   \STATE {\bf Inputs:} %\vspace{-0.25cm} 
   \begin{itemize}
       \item Training dataset ${\cal D}=\{(x_n, y_n)\}_{n=1}^N$ %\vspace{-0.25cm}
       \item Number of mixture components $K$ %\vspace{-0.25cm}
       \item Number of EM steps $J$
   \end{itemize}
   \STATE {\bf Initialize mixture parameters:} %\vspace{-0.25cm} 
   \begin{itemize}
       \item Sample $\theta_k^{(0)} \sim p(\theta)$ for all $k$. %\vspace{-0.25cm}
       \item Set $\pi_k^{(0)} = \frac{1}{K}$ for all $k$. %\vspace{-0.25cm}
   \end{itemize}
    \FOR{$j=1,\ldots, J$ } 
   \STATE {\bf E-Step:} Update posterior probabilities $\gamma_{k, n}^{(j)}$ according to \eqref{eq: posterior_updates} with mixture weights $\pi_{k}^{(j-1)}$ and mixture parameters $\theta_k^{(j-1)}$ for all $k$ and $n$.
   \STATE {\bf M-Step:} Update mixture weights $\pi_k^{(j)}$ and parameters $\theta_k^{(j)}$ for all $k$ as
   \begin{equation*}
        \pi_k^{(j)} = \frac{1}{N}\sum_{n=1}^N \gamma_{k, n}^{(j)}
   \end{equation*} 
   and
    \begin{equation*}
    \theta_k^{(j)} = \argmax_{\theta_k\in\Theta_k} \sum_{n=1}^N \gamma_{k, n}^{(j)} \ell_{\theta_k}(x_n, y_n)  
    \end{equation*} 
    \ENDFOR
    \STATE {\bf Return:} $\pi_k^\star = \pi_k^{(J)}$  and $\theta_k^\star = \theta_k^{(J)}$ for all $k$.  
    \end{algorithmic}
\end{algorithm}

Our implementation of DGMEs trained via the EM algorithm is summarized in Algorithm \ref{alg: dgme}. To initialize the ensemble, the parameters of each network in the ensemble are randomly initialized, while the mixture weights are assumed to be equal. The algorithm is run for $J$ steps or alternatively until some stopping criterion is met. The E-Step for updating the posterior probabilities is computed directly for each sample in the training set. In the M-Step, the updates for the mixture weights are also carried out analytically, but for mixture component parameters $\theta_k$
we use the stochastic optimization to numerically solve for the updates, as an analytical solution is not available. At round $j$, we initialize each network to $\theta_k^{(j-1)}$ and then run the Adam optimizer for $E$ epochs to minimize the weighted negative log-likelihood in \eqref{eq: weighted_nll}, where the weights are given by $\gamma_{k, n}^{(j)}$ for all $n$. Finally, we note that the computational complexity of each EM step is equivalent to that of DEs and the overall time complexity scales linearly with the number of EM steps. 


\subsection{Quantifying Epistemic Uncertainty} \label{sec:epistemic-uncertainty}
It is important to highlight that up until this point, we have not explicitly considered epistemic uncertainty in DGMEs. This is because the operation of training DGMEs according to Algorithm \ref{alg: dgme} yields a single set of parameters of the assumed Gaussian mixture model.\footnote{This point highlights the intrinsic difference in training DEs versus training DGMEs. DGMEs do not have a ``Bayesian" interpretation, because the EM algorithm used to train them only outputs a single set of possible parameters for the DGMEs (i.e., the corresponding posterior distribution of the weights is a  Dirac measure centered at the learned parameter values).} To account for model uncertainty, we need to account for the uncertainty in the parameters of the mixture (i.e., the mixture weights and/or the weights of the ensemble neural networks).  One simple way to do this is to apply MCD to the training procedure of DGMEs --- although we emphasize other techniques can be applied to account for epistemic uncertainty (e.g., Laplace approximation or a variational approximation to the posterior parameters).

Let $a_k = [a_{k, 1},\ldots, a_{k, d_{\theta}}]^\intercal \in \{0, 1\}^{d_\theta}$ denote a random binary vector of the same size as each $\theta_k$ and let $p_d\in[0, 1]$ denote a fixed dropout probability. Also, let $\theta^\star=\{\pi_1^\star, \theta_1^\star, \ldots,\pi_K^\star, \theta_K^\star\}$ denote the parameters learned by running Algorithm \ref{alg: dgme} with dropout incorporated in the training in the M-Step. For a given mixture component $k$, samples from the approximate posterior distribution of $\theta_k$ learned via dropout can be obtained via the following procedure: 
\begin{align*}
    a_{k, i} &\sim {\rm Bernoulli}(p_d), \quad i=1,\ldots, d_{\theta}, \\
    \theta_k &= a_k \odot \theta_k^\star,
\end{align*}
where $\odot$ denotes a Hadamard (or element-wise) product. It follows that a sample from the predictive distribution can directly be obtained as follows:
\begin{align}
    k &\sim {\rm Categorical}(\pi_1,\ldots, \pi_K),  \label{eq: sample_categorical}\\
    a_{k, i} &\sim {\rm Bernoulli}(p_d), \quad i=1,\ldots, d_{\theta}, \label{eq: sample_mask}\\
    \theta_k &= a_k \odot \theta_k^\star, \label{eq: sample_param}\\
    y &\sim  p_k(y|x, \theta_k). \label{eq: sample_pred}
\end{align}
In this procedure, one first samples the mixing component $k$ via \eqref{eq: sample_categorical}. Then, one draws a sample from the approximate posterior distribution of the parameters of the $k$-th mixture via \eqref{eq: sample_mask}-\eqref{eq: sample_param}. Finally, a prediction can be sampled via \eqref{eq: sample_pred}. We refer the reader to the Supplementary Material, Section E for details on the validity of this sampling procedure. 


\subsection{Theoretical Insights} \label{sec:theory}

In this section we provide insights into the connection between DGMEs and DEs, along with general results on convergence of our training procedure using DGMEs. We refer the reader to the Supplementary Material, Section A for details on the proofs of each propositions.%~\ref{sec:app-proofs}

Proposition~\ref{prop:max-lower-bound} shows that maximizing the data likelihood directly as in DGMEs achieves an equal or better likelihood than maximizing each ensemble member's likelihood separately as in DEs.

\begin{proposition} \label{prop:max-lower-bound}
Under the assumption that $\pi_i = 1/K$ for $i=1,..,K-1$,
%$(\pi_1,\ldots,\pi_K)=(1/K, \ldots, 1/K)$,
maximizing the Gaussian mixture data likelihood directly achieves better or equal joint likelihood than maximizing each ensemble member's likelihood separately.
\end{proposition}

\textit{Proof Sketch.} The result can be obtained by using Jensen's inequality on the joint log-likelihood of equation~\eqref{eq: exact_data_likelihood_max} along with the assumption.

Next, Proposition~\ref{prop:em-convergence} combines recent results on neural network convergence in regression by \citet{arora2019fine} and \citet{farrell2021deep} with classical EM analysis \citep{wu1983convergence} to give intuition on why DGMEs should converge towards the maximum of the data likelihood\footnote{We note Proposition~\ref{prop:em-convergence} covers a specific setup, in which mean and variance function estimation is performed separately by using a shared pre-trained feature extraction layer and that the true data generating process is identifiable with a mixture model to begin with. A more thorough investigation on both using a separate neural network from mean and variance, as well as the under- or over-specified case, is outside of the scope of this paper.}.

\begin{assumption}[Non-flatness of the weighted log-likelihood] \label{assumption-min-likelihood}
    Given a DGMEs with $K$ mixtures, in each EM round $t$ there exists an $\epsilon_{t,k}$ such that:

    \begin{equation*}
        \sum_{n=1}^N \gamma_{k,n}\left( \ell_{\theta^{*}}(x_n, y_n) -  \ell_{\theta^{(t)}}(x_n, y_n) \right) \geq \frac{\epsilon_{t,k}}{K},
    \end{equation*}
    where $\theta^{*}_k = \argmax_{\theta \in \Theta} \sum_{n=1}^N \gamma_{k,n} \ell_{\theta}(x_n, y_n)$. Let $\epsilon = \min_{t \in T, k \in K} \epsilon_{t,k}$.
\end{assumption}

\begin{assumption}[Smoothness of the true mean function] \label{assumption-true-function-mean}
Let $\mu(x): \mathcal{X} \to \mathbb{R}$ be the true mean function and let $X \subset \mathcal{X}$. Assume there exists some $\beta \in \mathbb{N}^+$ such that $\mu(x) \in \mathcal{W}^{\beta, \infty}(X)$, where $\mathcal{W}^{\beta, \infty}(X)$ is a $(\beta, \infty)$-Sobolev ball.
\end{assumption}

\begin{assumption}[Smoothness of the true variance function] \label{assumption-true-function-variance}
Let $\sigma(x): \mathcal{X} \to \mathbb{R}^{+}$ be the true variance function and let $X \subset \mathcal{X}$. Let $\mathbf{H}^{\infty}$ be the Graham matrix as defined by \citet[Equation 12]{arora2019fine}, and assume that there exists an $M \in \mathbb{R}$ such that $\sigma(x)^T \left(\mathbf{H}^\infty\right) \sigma(x)^T  \leq M$ for some $M \in \mathbb{R}$.
\end{assumption}

\begin{assumption}[Non-degenerate weights] 
\label{assumption-non-degenerate-weights}
In each EM iteration, the weights are positive and bounded away from zero, e.g., $\pi^{(t)}_i > \xi^{(t)}_i > 0$.
\end{assumption}

\begin{proposition} \label{prop:em-convergence}
Under assumptions \ref{assumption-min-likelihood}, \ref{assumption-true-function-mean}, \ref{assumption-true-function-variance} and \ref{assumption-non-degenerate-weights}, let the mean and variance in each ensemble model be estimated via a separate 2-layer deep ReLu network from a common feature extraction layer. Then the DGMEs EM algorithm convergences to a non-stationary point that maximizes the data likelihood with high probability.
\end{proposition}

\textit{Proof Sketch.} The result follows if one shows that $Q(\theta; \theta^{(j)})$ is an increasing function of the EM steps $j$ (\citealt{wu1983convergence}), for parameter values $\theta^{(j)}$ that are not stationary points of $Q(\theta; \theta^{(j)})$. In the DGMEs case, this corresponds to proving that the weighted log-likelihood in each ensemble increases at every round $j$. The result follows by combining assumptions on the non-flatness of the weighted log-likelihood (A.3), the smoothness of true mean function (A.4) and the smoothness of the true variance function (A.5) with the results obtained about convergence of deep ReLU networks by \citet{farrell2021deep} and \citet{arora2019fine} respectively.

Finally, Proposition~\ref{prop:conn-under-initialization} connects DGMEs and DEs, showing DEs is equivalent to a single-EM-step of DGMEs under specific neural network weights initialization. As shown in Proposition~\ref{prop:em-convergence}, the EM training of DGME improves the function $Q$ at each iteration $t$, i.e., $Q\left(\theta^{(t+1)}, \theta^{(t)}\right) \geq Q\left(\theta^{(t)}, \theta^{(t)}\right)$. Hence, the final joint DGME likelihood will be larger or equal to the joint likelihood achieved by DE.

\begin{proposition} \label{prop:conn-under-initialization}
If the weights of each ensemble member are initialized to 0 with fixed bias terms, a single EM step for DGMEs is equivalent to perform DEs.
\end{proposition}

\textit{Proof Sketch.} The initialization schema implies that mixture membership is equal across samples in the first expectation round of the EM. Hence, the first M-step consists in training $K$ separate networks with each log-likelihood contribution being weighted equally. 


\section{Experiments}
\label{experiments}
We evaluate the empirical performance of DGMEs via three different numerical experiments. We compare our method to the MDNs \citep{bishop1994mixture}, MCD \citep{gal2016dropout}, and DEs \citep{lakshminarayanan2017de}. MCD and DEs are widely considered to be state-of-the-art solutions for quantifying predictive uncertainty in deep learning models and have repeatedly been used as baselines for developing new techniques. Additional results and figures can be found in the Supplementary Material, Section B. A summary table qualitatively comparing DGMEs to the benchmarks can be found in the Supplementary Material, Section D.

\subsection{Toy Regression}
\label{experiments:toy}
Consider the following model:
\begin{equation}
    y_n = u_nx_n^3 + \epsilon_n,
\end{equation}
where $u_n\in\{-1, 1\}$ with $p_u\triangleq P(u_n=-1)$ and $\epsilon_n\sim p(\epsilon)$ for all $n=1,\ldots,N$. We generate $N=800$ training samples from this model for the training set, where the input values $x_n$ range from -4 to 4. For each considered setting, we use a learning rate of $\eta=0.01$, a batch size of 32, and $E=80$ epochs to resolve the stochastic optimization problem in the M-step. For each method, we utilize a dropout probability $p_d=0.1$ to account for epistemic uncertainty. Additionally, we generate data from this toy model under three different noise settings to demonstrate the flexibility and expressive power of DGMEs as compared to other baselines. Unless otherwise stated, we assume $K=5$ networks in each mixture model-based approaches (i.e., MDNs, DEs, and DGMEs). Experimental results are described below for each noise scenario. Additional experimental results and ablation studies are provided in the Supplementary Material, Section B.1.


\begin{figure*}[!ht]
    \centering
    \includegraphics[width=\linewidth, trim={0 0cm 0cm 0}, clip]{./figures/sota_comparison_heavy_tailed.eps}
    %\vspace*{-0.75cm}
    \caption{Histogram of samples from the predictive distributions for a single training example (top panel) and for a single test example (bottom panel) from the heavy-tailed toy regression example, shown with corresponding sample kurtosis value $\kappa$. DGMEs generally estimate heavier tailed predictions for both training and test samples, while baseline approaches samples are closer to following a Gaussian distribution.}
    \label{fig: sota_comparison_heavy_tailed}
\end{figure*}

\begin{figure*}[!ht]
    \centering
    \includegraphics[width=0.9\linewidth, trim={0 17.6cm 0 0}, clip]{./figures/sota_comparison_bimodal.eps}
    \caption{Predictive distribution plots for the bimodal Gaussian toy regression example. DEs cannot capture the multimodality of the noise, while MDNs and DGMEs can. Furthermore, DGMEs approximate the mixture weights of the noise accurately (ground truth: $\pi_1=0.7$ and $\pi_2=0.3$).}
    \label{fig: sota_comparison_bimodal}
\end{figure*}


\begin{table*}[t]
    \resizebox{\textwidth}{!}{
\begin{tabular}{lccccccc}
\toprule
\multicolumn{8}{c}{\textbf{\textsc{Test RMSE}}}\\
\midrule
Dataset        &          MDNs &           MCD &           DEs &   DGMEs (J=1) &   DGMEs (J=2) &   DGMEs (J=5) &  DGMEs (J=10) \\
\midrule
Boston housing &  \bf 2.79 $\pm$ 0.84 &  2.97 $\pm$ 0.85  &  3.28 $\pm$ 1.00 &  3.11 $\pm$ 0.94 &  3.00 $\pm$ 0.90 &  2.87 $\pm$ 0.86 &    2.83 $\pm$ 0.91 \\
Concrete       &  5.21 $\pm$ 0.56 &   5.23 $\pm$ 0.53 &  6.03 $\pm$ 0.58 &  5.67 $\pm$ 0.57 &  5.36 $\pm$ 0.51 &  5.20 $\pm$ 0.59 &  \bf 5.14 $\pm$ 0.58 \\
Energy         & \bf 0.71 $\pm$ 0.14 &   1.66 $\pm$ 0.19 &  2.09 $\pm$ 0.29 &  2.01 $\pm$ 0.29 &  1.79 $\pm$ 0.24 &  1.22 $\pm$ 0.25 &  1.07 $\pm$ 0.41 \\
Kin8nm         &  0.08 $\pm$ 0.00 &   0.10 $\pm$ 0.00 &  0.09 $\pm$ 0.00 &  0.08 $\pm$ 0.00 &  0.08 $\pm$ 0.00 &  \bf 0.07 $\pm$ 0.00 &  \bf 0.07 $\pm$ 0.00  \\
Power plant    &  4.12 $\pm$ 0.17 &   \bf 4.02 $\pm$ 0.18 &  4.11 $\pm$ 0.17 &  4.12 $\pm$ 0.16 &  4.10 $\pm$ 0.15 &  4.07 $\pm$ 0.15 &     4.05 $\pm$ 0.13 \\
% Protein        &          NaN &   4.36 $\pm$ 0.04 &  4.71 $\pm$ 0.06 &          NaN &          NaN &          NaN &          NaN \\
Wine           &  0.66 $\pm$ 0.04 &  \bf 0.62 $\pm$ 0.04 &  0.64 $\pm$ 0.04 &  0.63 $\pm$ 0.04 &  0.64 $\pm$ 0.04 &  0.64 $\pm$ 0.04 &      0.66 $\pm$ 0.05 \\
Yacht          &  0.96 $\pm$ 0.36 &   1.11 $\pm$ 0.38 &  1.58 $\pm$ 0.48 &  0.98 $\pm$ 0.38 &  0.85 $\pm$ 0.36 &  0.83 $\pm$ 0.40 &     \bf 0.70 $\pm$ 0.26 \\
\bottomrule
\end{tabular}
}
\caption{Average RMSE of test examples for regression experiments on real datasets. DGMEs obtain competitive or better performance in terms of RMSE on the majortity of datasets as compared to the baselines.}\label{tab:regression_experiments_RMSE}
\end{table*}



\begin{table*}[t]
    \centering
    \resizebox{\textwidth}{!}{
\begin{tabular}{lccccccc}
\toprule
\multicolumn{8}{c}{\textbf{\textsc{Test NLL}}}\\
\midrule
Dataset             &           MDNs &           MCD &            DEs &    DGMEs (J=1) &    DGMEs (J=2) &    DGMEs (J=5) &  DGMEs (J=10) \\
\midrule
Boston housing      &   2.62 $\pm$ 0.43 &  2.46 $\pm$ 0.25  &   2.41 $\pm$ 0.25 &   2.34 $\pm$ 0.19 &  \bf 2.33 $\pm$ 0.22 &   2.41 $\pm$ 0.25 &     2.46 $\pm$ 0.31 \\
Concrete            &   3.11 $\pm$ 0.26 &   3.04 $\pm$ 0.09 &   3.06 $\pm$ 0.18 &   3.04 $\pm$ 0.11 &   3.00 $\pm$ 0.12 &   2.95 $\pm$ 0.13 & \bf 2.94 $\pm$ 0.14 \\
Energy              &  \bf 1.18 $\pm$ 0.30 &   1.99 $\pm$ 0.09 &   1.38 $\pm$ 0.22 &   1.71 $\pm$ 0.19 &   1.48 $\pm$ 0.15 &   1.20 $\pm$ 0.23 &  1.20 $\pm$ 0.40 \\
Kin8nm              &  -1.18 $\pm$ 0.04 &  -0.95 $\pm$ 0.03 &  -1.20 $\pm$ 0.02 &  -1.20 $\pm$ 0.02 &  -1.23 $\pm$ 0.03 & -1.24 $\pm$ 0.02 &  \bf   -1.25 $\pm$ 0.02 \\
Power plant         &   2.81 $\pm$ 0.04 &   2.80 $\pm$ 0.05 & \bf 2.79 $\pm$ 0.04 &   2.82 $\pm$ 0.03 &   2.81 $\pm$ 0.03 &   2.81 $\pm$ 0.03 &           \bf 2.79 $\pm$ 0.03 \\
% Protein             &           NaN &   2.89 $\pm$ 0.01 &   2.83 $\pm$ 0.02 &           NaN &           NaN &           NaN &          NaN \\
Wine                &   1.01 $\pm$ 0.10 &  \bf 0.93 $\pm$ 0.06 &   0.94 $\pm$ 0.12 &   0.95 $\pm$ 0.11 &   0.96 $\pm$ 0.11 &   0.96 $\pm$ 0.12 &    1.10 $\pm$ 0.09 \\
Yacht               &   1.18 $\pm$ 0.17 &   1.55 $\pm$ 0.12 &   1.18 $\pm$ 0.21 &   1.07 $\pm$ 0.22 &   0.75 $\pm$ 0.22 &  0.60 $\pm$ 0.29 &  \bf   0.49 $\pm$ 0.29 \\
\bottomrule
\end{tabular}
}
\caption{Average NLL of test examples for regression experiments on real datasets. DGMEs obtain competitive or better performance in terms of NLL on the majority of datasets as compared to the baselines.}\label{tab:regression_experiments_NLL}
\end{table*}

\paragraph{Case 1 - Gaussian Noise:}
 We set $p_u=0$ and assume that the noise is zero-mean and Gaussian distributed with variance of $9$. This is analogous to the setup utilized in \cite{hernandez2015probabilistic}. Figure 3 (Supplementary Material, Section B.1.5) shows the performance of DGMEs as compared to the baselines, where we observe that it outperforms MDNs and obtains comparable results to MCD and DEs. 
\paragraph{Case 2 - Heavy-tailed Noise:}
We set $p_u=0$ and assume that the noise distributed according to a zero-mean Student-t distribution with $\nu=3$ degrees of freedom with variance of $9$. Figure \ref{fig: sota_comparison_heavy_tailed} shows the histogram of samples from the predictive distribution of both a training and a test input with their corresponding sample (excess) kurtosis. We observe that on the training examples (i.e., purple histograms), only MDNs and DGMEs are able to learn the heavy-tailedness of the noise, as both MCD and DEs obtain a kurtosis close to 0. Unlike the baseline approaches, which are unable to learn the tail behavior in the test example, we observe that DGMEs is the best method at capturing the heavy-tailedness of the test examples, as it gives the largest corresponding kurtosis. 
\paragraph{Case 3 - Bimodal Gaussian Noise:}
We set $p_u=0.3$ and assume that the noise is zero-mean and Gaussian distributed with variance of $9$. For this example, we only compare the mixture-based approaches assuming $K=2$ components. Figure \ref{fig: sota_comparison_bimodal} shows the predictive density for the corresponding 99\% credible interval for each mixture in each approach, where for DGMEs we also show the learned mixture weights of each component. We observe that only MDNs and DGMEs are able to capture the bimodality of the data, with DGMEs also accurately capturing the mixture weight proportions. DEs instead overestimates the heteroscedastic variance in each network. This is due to the fact that DEs train each ensemble member independently under the assumption of Gaussian likelihood. We also show that DGMEs can robustly estimate this bimodality, even if the assumed number of mixture components is larger than 2 (see Supplementary Material, Section B.1.3).


\subsection{Regression on Real Datasets}
We evaluate the performance of DGMEs in regression against MDNs, MCD and DEs on a set of UCI regression benchmark datasets \citep{Dua2019UCI}; see Supplementary Material, Section B.2, for further details on the datasets. 
We use the experimental setup used in \cite{hernandez2015probabilistic}, with each dataset split into 20 train-test folds. We use the same network architecture across each dataset: an MLP with a single hidden layer and ReLU activations, containing 50 hidden units. For each dataset we train for $E=40$ total epochs with a batch size of 32 and a learning rate of $\eta=0.001$. To be consistent with previous evaluations, we used $K=5$ networks in our ensemble and provide results for DGMEs for different numbers of EM steps $J\in\{1, 2, 5, 10\}$. Our results are shown in Tables \ref{tab:regression_experiments_RMSE} and \ref{tab:regression_experiments_NLL}, where we evaluate the root-mean-squared error (RMSE) and the negative log-likelihood (NLL) on the test set averaged over the different folds, respectively. In the same table, we also report the results for MDNs, MCD and DEs. Experimental results for MCD and DEs can also be found in their respective papers \citep{gal2016dropout, lakshminarayanan2017de}. Note that in this experiment, we do not apply dropout to MDNs and DGMEs and only account for the uncertainty obtained from training the models to maximize the NLL of the samples according to the Gaussian mixture assumption in \eqref{eq: mixture_of_gaussains_assumptions}.

We observe that in this experiment DGMEs are able to obtain competitive (or better) performance  with respect to the baseline methods. For certain datasets, we observe that increasing the number of EM steps greatly improves the performance (e.g., Concrete, Energy, Power Plant, and Yacht). We can see that this is not generally true for all datasets: for example, for the Boston housing dataset, increasing the number of EM steps begins to degrade the performance of the model in terms of NLL. We emphasize that performance can further be improved by incorporating dropout in the training procedure, where the dropout probability $p_d$ can be selected using cross-validation on each train-test split. 


\subsection{Financial Time Series Forecasting}
For the final experiment, we focus on the task of one-step-ahead forecasting for financial time series. In particular, using historical daily price data from Yahoo finance \footnote{\url{https://finance.yahoo.com/}}, we formulate a one-step ahead forecasting problem using a long short-term memory (LSTM) network \citep{LSTMhochreiter}. The input to the network is a time series that represents the closing price of a particular stock over the past 30 trading days. The target output is the next trading day’s closing price. We assess performance of the model using two metrics: (1) the NLL of the test set, and (2) the RMSE score on the test set. We evaluate each method on three different datasets:
\begin{itemize}
    \item {\bf GOOG - stable market regime:} We use training data from the Google (GOOG) stock from the period of Jan 2019 - July 2022 and test on GOOG stock data from the period of August 2022 - January 2023. 
    \item {\bf RCL - market shock regime:} We use training data from the Royal Caribbean (RCL) stock from the period of Jan 2019 - April 2020 and test on RCL stock data from the period of May 2020 - September 2020. 
    \item {\bf GME - high volatility regime:} We use the training data from the Gamestop (GME) stock during the ``bubble" period of Nov 2020 - Jan 2022 and test on GME stock data following that period.
\end{itemize}
We ran each of the previously tested baselines and DGMEs on the three scenarios previously described. Additionally, we also test the MultiSWAG approach highlighted in \cite{NEURIPS2020MultiSWAG}, due to its effectiveness in quantifying epistemic uncertainty, which is of particular importance for the market shock regime \citep{chandra2021bayesian}. We train each model on each dataset for 5 independent runs and report the mean and standard error of both the test NLL and the test RMSE in Tables \ref{tab:forecasting_experiments_RMSE} and \ref{tab:forecasting_experiments_NLL},  where we have bolded the best performing method in each experiment according to the mean value of the metric. For details on selection of hyperparameters of each of the methods, we refer the reader to the Supplementary Material, Section B.3.

\begin{table}[t]
\centering
\resizebox{0.5\textwidth}{!}{
\begin{tabular}{@{}clllll@{}}
\toprule
\multicolumn{6}{c}{\textbf{Test RMSE}}                                                                                                          \\ \midrule
Dataset & \multicolumn{1}{c}{MDNs} & \multicolumn{1}{c}{MCD} & \multicolumn{1}{c}{DEs}  & \multicolumn{1}{c}{MultiSWAG} & \multicolumn{1}{c}{DGMEs} \\ \midrule
GOOG    & $2.74 \pm 0.06$         & $3.86 \pm 0.16$         & $2.73 \pm 0.03$          & $2.71 \pm 0.05$  & ${\bf 2.71 \pm 0.04}$               \\
RCL     & $15.01 \pm 4.71$        & $16.19 \pm 10.18$       & $14.92 \pm 1.44$              & ${\bf 11.73 \pm 0.45}$   & $14.49 \pm 2.73$       \\
GME     & $11.14 \pm 7.75$        & $2.70 \pm 0.47$         & $3.21\pm0.46$                    & ${\bf 2.00 \pm 0.06}$    & $3.19 \pm 0.33$     \\ \bottomrule
\end{tabular}}
\caption{Average RMSE of the test examples for the financial forecasting experiment.}
\label{tab:forecasting_experiments_RMSE}
\end{table}


\begin{table}[t]
\centering
\resizebox{0.5\textwidth}{!}{\begin{tabular}{@{}cccccc@{}}
\toprule
\multicolumn{6}{c}{\textbf{Test NLL}}                                                                           \\ \midrule
Dataset & MDNs               & MCD             & DEs                      & MultiSWAG    & DGMEs        \\ \midrule
GOOG    & $2.46 \pm 0.03$   & $2.98 \pm 0.01$ & $2.44 \pm 0.01$ & ${2.54 \pm 0.00}$ & ${\bf 2.43 \pm 0.02}$    \\
RCL     & $18.83 \pm 17.82$ & $6.12 \pm 3.93$ & $5.94 \pm 0.80$  & ${6.21 \pm 0.18}$ & ${\bf 5.00 \pm 0.76}$   \\
GME     & $6.01 \pm 3.85$   & $2.46 \pm 0.16$ & $2.66 \pm 0.13$   & ${\bf 2.14 \pm 0.03}$ & $2.61 \pm 0.31$   \\ \bottomrule
\end{tabular}}
\caption{Average NLL of the test examples for the financial forecasting experiment.}
\label{tab:forecasting_experiments_NLL}
\end{table}

The results indicate that for the GOOG dataset, DGMEs achieve, on average, the best NLL and RMSE score. In the case of the RCL dataset, we observe an interesting result. DGMEs attain the best performance in terms of NLL, but MultiSWAG does best in terms of RMSE. We believe DGMEs outperform in terms of NLL because the likelihood function assumed by DGMEs is a true Gaussian mixture, while the MultiSWAG approach is applying stochastic weight averaging Gaussian (SWAG) independently on multiple networks under the Gaussian likelihood assumption. This gives DGMEs the advantage in terms of learning the complex nature of the RCL dataset. On the other hand, we have found that since MultiSWAG is accounting for uncertainty using SWAG, it appears to make model training more stable (hence the smaller standard error on each of the metrics) and better accounts for epistemic uncertainty. This could possible explain why the RMSE score is lower than that of DGMEs and with smaller standard error. For the GME dataset, MultiSWAG outperforms DGMEs consistently, and with tighter standard error bars.  As a final remark, we emphasize that in our paper, we have accounted for epistemic uncertainty in DGMEs using dropout, but other methods could have been used (such as variational inference, Laplace approximation, or SWAG). Based on the results of this experiment, we highlight the possibility of incorporating SWAG in the training of DGMEs as a better way to account for epistemic uncertainty (as opposed to dropout).

\section{Conclusions}
\label{conclusions}

This paper proposes DGMEs, a novel probabilistic DL ensemble method for jointly quantifying epistemic and aleatoric uncertainty. Unlike deep ensembling, DGMEs optimizes the data likelihood directly and is able to capture complex behavior in the predictive distribution (e.g., heavy-tailedness and multimodality) by modeling the conditional distribution of the data as a Gaussian mixture. Our experiments show that DGMEs can capture more complex distributional properties than a variety of probabilistic DL baselines in regression settings and obtain competitive performance on detecting OOD samples in classification settings. As next steps, alternative mechanisms for handling the epistemic uncertainty can be considered. For example, one can instead form a variational approximation to the posterior of each mixture component, thereby  forming a Gaussian mixture approximation to the posterior parameters of the ensemble. Additionally, a more thorough analysis on the classification setting can be considered. Rather than using a mixture of categorical distributions to model the predictive density, one can use a mixture of Dirichlet distributions to account for uncertainty in the class probabilities, similar in line to the work of \cite{hobbhahn2022fast}. Finally, DGMEs can be applied to improve the efficiency of active learning algorithms and exploration strategies in reinforcement learning. 


\textbf{Acknowledgments.}
This paper was prepared for informational purposes by the Artificial Intelligence Research group of JPMorgan Chase \& Co. and its affiliates (``JP Morgan''), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

% References
\bibliography{references}
\end{document}
