% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
% If you use BibTeX in apalike style, activate the following line:
% \bibliographystyle{apalike}
\usepackage{multirow, booktabs} \usepackage{lipsum}
% packages added by me
\usepackage{subcaption}
% \usepackage[pdftex]{graphicx}
\usepackage[ruled,vlined]{algorithm2e} \usepackage{hyperref}
\usepackage{multirow}
\usepackage[acronym]{glossaries}
\newacronym{elbo}{ELBO}{Evidence Lower Bound}
\newacronym{pdmp}{PDMP}{Piecewise Deterministic Markov Process}
\newacronym{bps}{BPS}{Bouncy Particle Sampler}
\newacronym{sbps}{SBPS}{Stochastic Bouncy Particle Sampler}
\newacronym{esbps}{eSBPS}{Efficient Stochastic Bouncy Particle Sampler}
\newacronym{atsbps}{atSBPS}{Adaptive Thinning Stochastic Bouncy Particle Sampler}
\newacronym{ipp}{IPP}{Inhongeneous Poisson Process}
\newacronym{vi}{VI}{Variational Inference}
\newacronym{hmc}{HMC}{Hamiltonian Monte Carlo}
\newacronym{sghmc}{SGHMC}{Stochastic Gradient Hamiltonian Monte Carlo}
\newacronym{bnn}{BNN}{Bayesian Neural Network}
\newacronym{cnn}{CNN}{Convolutional Neural Network}
\newacronym{sgld}{SGLD}{Stochastic Gradient Langevin Dynamics}
\newacronym{ood}{OOD}{Out of Distribution}
%
% \usepackage{amsmath}
\usepackage{amsfonts}
% my new commands
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\taub}[0]{\tau_{\text{bounce}}} \newcommand{\taur}[0]{\tau_{\text{ref}}}
\newcommand{\tauz}[0]{\mathbf{\tau}_{Z}} \newcommand{\myomega}[0]{\mathbf{\omega}}
\newcommand{\myv}[0]{\mathbf{v}}
\newcommand{\myw}[0]{\mathbf{\omega}}
\newcommand{\sigbps}[0]{$\sigma$BPS }
\usepackage{tabularx}
\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}

\newlength{\Oldarrayrulewidth}
\newcommand{\Cline}[2]{%
  \noalign{\global\setlength{\Oldarrayrulewidth}{\arrayrulewidth}}%
  \noalign{\global\setlength{\arrayrulewidth}{#1}}\cline{#2}%
  \noalign{\global\setlength{\arrayrulewidth}{\Oldarrayrulewidth}}}

\usepackage{float}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Piecewise Deterministic Markov Processes for Bayesian Neural Networks}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<ej.goan@qut.edu.au>?Subject=UAI2023 BNN Paper}{Ethan~Goan}{}}
\author[1]{Dimitri~Perrin}
\author[1]{Kerrie Mengersen}
\author[1]{Clinton Fookes}
% Add affiliations after the authors
\affil[1]{%
  Queensland University of Technolgy
}

\begin{document}
\maketitle

\begin{abstract}
  Inference on modern Bayesian Neural Networks (BNNs) often relies on a
  variational inference treatment, imposing violated assumptions of independence
  and the form of the posterior. Traditional MCMC
  approaches avoid these assumptions at the cost of increased computation due to
  its incompatibility to subsampling of the likelihood. New Piecewise Deterministic
  Markov Process (PDMP) samplers permit subsampling, though introduce a model-specific inhomogenous Poisson Process (IPPs) which is difficult to sample
  from. This work introduces a new generic and adaptive thinning scheme for
  sampling from these IPPs, and demonstrates how this approach can
  accelerate the application of PDMPs for inference in BNNs. Experimentation
  illustrates how inference with these methods is computationally feasible, can
  improve predictive accuracy, MCMC mixing performance, and provide informative uncertainty measurements when
  compared against other approximate inference schemes.
\end{abstract}

%%%% INTRODUCTION
\section{Introduction}
%
Since \Gls{hmc} was first developed for Bayesian inference
\cite{neal2012bayesian}, sampling methods have seen relatively little
application to \Glspl{bnn}. Flexibility, inference diagnostics and asymptotic
guarantees of HMC comes at the cost of computational complexity as each data
point needs to be used to compute the entire likelihood, and to perform Metropolis Hastings corrections. As models and data
sets have grown, this expense has not been offset by the considerable
performance increase in computational hardware. A recent study found that the
fitting of a HMC model for ResNet20 required a computational cost equivalent to
60 million SGD epochs to obtain only 240 samples from three chains \cite{izmailov2021bayesian}.
%
%
\par
%
%
\begin{figure}[t]
  \centering
  \subfloat[][]{\includegraphics[width=0.5\linewidth]{./figs/pred.pdf}}
  \subfloat[][]{\includegraphics[width=0.5\linewidth]{./figs/corr.pdf}}
  \\
  \subfloat[][]{\includegraphics[width=0.8\linewidth]{./figs/density.pdf}}
  \caption{Example of correlations between the parameters in the first layer of
    a BNN for a simple regression task. Plot (a) samples of predictive posterior
    from proposed method, (b) correlation between all parameters on the first
    layer, (c) kernel density estimate for a single parameter.}
  \label{fig:corr}
\end{figure}
%
%
%
To circumnavigate the computational expense, much research has explored the
application of approximate inference through the lens of \Gls{vi}
\cite{jordan1999introduction, wainwright2008graphical, blei2017variational} or
through exploiting properties of SGD \cite{mandt2017stochastic}. \Gls{vi}
replaces the true target distribution with an approximate distribution that can
be easily manipulated, typically using a mean-field approach
where independence between parameters is assumed. These methods are attractive due to
their reduced computational complexity and their amenability to stochastic
optimisation. However, the suitability of these methods relies heavily on the
expressiveness of the approximate posterior to accurately model the true
distribution. Given the known correlations and frequent multi-modal structure
amongst parameters within BNNs \cite{barber1998ensemble, mackay1995probable}, a
mean-field approximation can be unsuitable for accurate inference. Figure
\ref{fig:corr} illustrates these properties for a simple BNN. Stochastic gradient MCMC methods such as
\Gls{sgld} aim to address this issue but requires prohibitively small and decreasing learning
rates to target the posterior that limits their applicability~\cite{nagapetyan2017true}.
%
%
\par
%
%
This work explores a new set of ``exact'' inference methods based on
\Glspl{pdmp} \cite{davis1984piecewise} to perform Bayesian inference. \Gls{pdmp} methods can maintain the
true posterior as its invariant distribution during inference whilst permitting
sub-sampling of the likelihood at each update. This property is attractive for
\Glspl{bnn} which typically are of large dimension in terms of parameters and
data sets. Furthermore, previous research has highlighted PDMP methods for favourable
performance in terms of mixing and sampling efficiency
\cite{bouchard2018bouncy, bierkens2019zig, wu2017generalized,
  bierkens2020boomerang}. The dynamics of these samplers are simple to simulate, though simulating the times
to update these dynamics is controlled by an \Gls{ipp} which can be difficult
to sample. This work explores an adaptive procedure to approximate
sampling of these event times to allow for approximate inference within the
context of \Glspl{bnn}. The contributions of this paper are the following,
%
%
\begin{itemize}
  \item Propose a novel adaptive thinning method for approximate sampling from
    \Glspl{ipp} events
  \item Develop a GPU-accelerated package for applying these methods to
    general models
  \item Evaluate the performance of these methods applied to computer vision
    tasks using \Glspl{bnn};
  \item Evaluate the suitability of \Gls{pdmp} samplers for
    \Glspl{bnn} and investigate how they can improve predictive accuracy, calibration and
    posterior exploration when compared against \Gls{sgld}.
\end{itemize}
%
%
MCMC
methods have often been seen as computationally prohibitive for models with
many parameters or where modern large data sets are used. It is hoped
that this work will demonstrate that approximate inference using MCMC
approaches for BNNs can be practically feasible, offer insightful
results, and to show how we can leverage exact methods for approximate
inference to more accurately target posterior distributions in \Glspl{bnn}.
%
%
%
%
%%%%%% PRELIMINARIES
\section{Preliminaries}
\label{sec:preliminaries}
%
%
%
Following the description from \cite{fearnhead2018piecewise}, \Gls{pdmp} are
defined by three key components: piecewise deterministic dynamics, an event rate,
and the transition kernel. For inference, the goal is to design these three key
components such that we can use the properties of a \Gls{pdmp} to sample from the
posterior distributions of our parameters $\mathbf{\omega}$. We represent the
deterministic dynamics as $\Psi(\myw, \myv, t)$, where $\myv$ is an auxiliary
velocity variable to guide posterior exploration with known distribution $\Phi(\myv)$ and $t$ represents time. At random events,
these dynamics are updated in accordance to a specified transition kernel.
Upon an update event, the piecewise deterministic dynamics of the system update according to the kernel,
and the state $\myw$ at the time of the update event serves as the starting
position for the next segment such that they are all connected.
%
%
\par
%
%
%

An \Gls{ipp} with rate function $\lambda(\mathbf{\omega}(t), \mathbf{v}(t))$ governs the
update times for the dynamics. All rate functions in this work rely upon the
negative joint log probability of the model,
\begin{equation}
  \label{eq:energy}
  U(\omega) = -\log \Big(p(\omega) p(\mathcal{D}|\omega)\Big),
\end{equation}
%
where $p(\omega)$ is a prior or reference measure and $p(\mathcal{D}|\omega)$ is our likelihood,
If these three components are suitably defined,
these processes can sample from a given posterior distribution. For derivations on how
to design these components to target a posterior distribution, the reader can
refer to \cite{fearnhead2018piecewise, vanetti2017piecewise, davis1993markov}. We now introduce the
samplers used within this work.
%
%
\par
%
%
\subsection{Bouncy Particle Sampler}
\label{sec:bps}
%
%
%
The dynamics of the \Gls{bps} \cite{bouchard2018bouncy} are given
by $\Psi(\myw, \myv, t) = \myw^{i} + \myv^{i} t$, where the superscripts indicate a
deterministic segment. The velocity remains constant within these segments and the
parameter space is explored linearly. The velocity is updated at event times
given by $ \tau \sim \text{IPP}( \lambda(\mathbf{\omega}(t)), \mathbf{v})$, where,
\begin{equation}
  \label{eq:bps_rate}
  \lambda(\mathbf{\omega}(t), \mathbf{v}) = \max \{0, \nabla U(\mathbf{\omega}) \cdot \mathbf{v}^{i} \}.
\end{equation}
%
%
%
Once an event time is sampled, the state of our variable ``bounces'' according
to a lossless inelastic Newtonian collision,
\begin{equation}
  \label{eq:bps_bounce}
  \mathbf{v}^{i+1}= \mathbf{v}^{i} - 2 \dfrac{ \nabla U(\mathbf{\omega}^{i+1}) \cdot \mathbf{v}^{i}}{ \norm{\nabla U(\mathbf{\omega}^{i+1}) }^{2}} \nabla U(\mathbf{\omega}^{i+1})
\end{equation}
%
%
where $\myw^{i+1}$ represents the end of the previous segment at time $\tau$, and serves as the starting position for the following segment.
%
%
The \Gls{bps} provides linear dynamics that are simple to simulate,
though relies only on local gradient information, which can lead to inefficient
exploration for \Glspl{bnn}. Preconditioning can allow us to address this.
%
%
\subsection{Preconditioned BPS}
To accelerate posterior exploration in directions of interest, we can
precondition the gradients to include more information about the structure of our
posterior space. Introduction of a
preconditioning matrix $A$ results in new dynamics of
$\Psi(\myw, \myv, t) = \myw^{i} + A\myv^{i} t$, and a new event rate,  $\lambda(\mathbf{\omega}(t), \mathbf{v}) = \max\{0, \mathbf{v} \cdot A \nabla U(\mathbf{\omega} + \mathbf{v}t) \}$.
% \end{equation}
%
%
Upon events, the velocity is updated according to,
%
%
\begin{equation}
  \label{eq:bps_bounce}
  \mathbf{v}^{i+1}= \mathbf{v}^{i} - 2 \dfrac{A \nabla U(\mathbf{\omega}^{i+1}) \cdot \mathbf{v}^{i}}{ \norm{A\nabla U(\mathbf{\omega}^{i+1}) }^{2}} A\nabla U(\mathbf{\omega}^{i+1}).
\end{equation}
%
%
%
With careful choice of $A$, exploration along certain axis can be appropriately
scaled. \cite{pakman2017stochastic} propose a preconditioner similar to
\cite{li2015preconditioned}, though our preliminary experimentation found
inconsistent results when applied to \Glspl{bnn}. Instead, we opt to build on the
approach of \cite{bertazzi2020adaptive}, where we use variance information of
our samples to precondition our dynamics. We choose the preconditioner such
that $A =\text{diag}\big(\Sigma^{\frac{1}{2}}\big)$, where $\Sigma$ is the estimated
covariance in our sample found during a warm-up period. As such, we refer to this
sampler as the \sigbps.

\subsection{Boomerang Sampler}
The Boomerang Sampler \cite{bierkens2020boomerang} introduces non-linear
dynamics for both parameter and velocity terms, and the inclusion of a Gaussian
reference measure for the parameters and
velocity $ \mathcal{N} (\myw_{\star}, \Sigma_{\star}) \otimes \mathcal{N}(0, \Sigma_{\star})$. The first term in this reference
measure can be seen as a replacement for the prior in the joint probability over
parameters and the second as the known distribution for the velocity component.
The parameters $\myw_{\star}$ and $\Sigma_{\star}$ can be specified as traditional prior, or
can be specified in an empirical approach where they are learnt from the data.
Within this work, we will set $\myw_{\star}$ to the MAP estimate. In the original
paper, $\Sigma_{\star}$ is set to the inverse of the Hessian, however, this can be
computationally prohibitive for \Glspl{bnn}. Instead, we sum over the first order
gradients at $\myw_{\star}$, and then compute the derivative with respect to this sum that is then
inverted and scaled such that,
%
\begin{equation}
  \label{eq:boomerang_sigma}
  \Sigma_{\star } = \gamma \Big[\sum_{i=0}^{N-1} \nabla_{\myw} \sum_{j=0}^{P-1}\nabla_{\myw} p(\mathcal{D}_{i}| \myw_{\star})_{j} \Big] ^{-1}
\end{equation}
%
where $N$ is the number of mini-batches present, $P$ is the number of parameters, subscript $j$ indicates summation over parameter gradients in our model and $\gamma$ is a hyperparameter to
adjust the scale as needed. This can be seen as a weighted stochastic average to
the inverse of a Hessian matrix diagonal.
%----
\par
%----
Unlike the \Gls{bps} samplers, the velocity does not
remain constant between events. The dynamics of the Boomerang sampler for
$\myw$ and $\myv$ within events are given
by
$\Psi(\myw, \myv, t)_{\myw} = \myw_{\star} - (\myw^{i} - \myw_{\star}) \cos(t) + \myv^{i}\sin (t)$,
$\Psi(\myw, \myv, t)_{\myv}= -(\myw^{i} - \myw_{\star}) \sin (t) + \myv^{i}\cos (t)$,
where the subscripts denote the parameter and velocity trajectory within the
deterministic segment. The event rate is the same as the \Gls{bps}, and the
starting velocity for the next segment is updated upon events as,
\begin{equation}
  \label{eq:bps_bounce}
  \mathbf{v}^{i+1}= \mathbf{v}^{i} - 2 \dfrac{\nabla U(\mathbf{\omega}^{i+1}) \cdot \mathbf{v}^{i}}{ \norm{\Sigma_{\star}^{\frac{1}{2}}\nabla U(\mathbf{\omega}^{i+1}) }^{2}} \Sigma_{\star}\nabla U(\mathbf{\omega}^{i+1}).
\end{equation}
%
%
%
\subsection{Velocity Refreshment}
\label{sec:refresh}
All of the samplers introduced fail to target the posterior explicitly when using
the above dynamics alone. Introduction of a refreshment step
rectifies this, which is governed by a homogeneous PP  $\tau_{ref} \sim \lambda(\lambda_{ref})$.
When $\tau_{ref} < \tau$, the velocity is instead randomly sampled from the known
reference distribution $\Phi(\myv)$, and $\tau_{ref}$ is used for the update event
time. For BPS samplers in this work, we use a refreshment distribution of the
form $\mathcal{N}(0, \sigma^{2})$, where $\sigma$ is a hyper-parameter to be set, and the Boomerang
sampler requires $\Phi(\myv) = \mathcal{N}(0, \Sigma_{\star})$. A summary of PDMP algorithms for
inference is described in Algorithm \ref{alg:pdmp}.
%
\begin{algorithm}
  \SetAlgoLined \KwResult{Samples from posterior distribution}
  \While{Sampling}{
    \tcp{Simulate event time}
    \tcp{event times in this work simulated with Algorithm \ref{alg:ipp}}
    $\tau \sim \text{PP}(\lambda(\myw, \myv))$\;
    \tcp{Simulate time of refresh event}
    $\taur \sim \text{PP}(\lambda_{\text{ref}})$\;
    $\tau^i = \text{min}(\tau, \taur)$\;
    \tcp{find end of current piecewise-deterministic segment, which will form start of next segment}
    $\omega^{i+1}  = \Psi(\myw, \myv, \tau^{i})_{\myw}$\;
    \eIf{$\tau^{i} = \tau$}{ \tcp{update according kernel}
      $\mathbf{v}^{i+1} = R(\mathbf{\omega}^{i+1}, \mathbf{v}^{i})$\; }{
      \tcp{refresh velocity from known distribution}
      $\mathbf{v}^{i+1} \sim \Phi(\mathbf{v})$; } }
  \caption{Application of PDMP samplers for Inference}
  \label{alg:pdmp}
\end{algorithm}
%
%
%
%
\subsection{Problems with the event rate}
%
%
%
With the deterministic dynamics illustrated in these samplers, the main challenge in
implementation of these methods is due to the sampling of the event times.
Analytic sampling from $\text{IPP}\big(\lambda(t)\big)$ requires being able to invert
the integral of the event rate w.r.t. time,
%
%
\begin{equation}
  \label{eq:event_analytic}
  \Lambda(t) = \int_0^\tau \lambda(t) dt = \int_0^\tau \max\{0, \mathbf{v} \cdot A \nabla U(\mathbf{\omega}(\myv, t) \} dt,
\end{equation}
%
%
where $A = \mathbf{I}$ for the \Gls{bps} and Boomerang samplers.
Inverting the above integral is feasible only for simple models. A general case
for sampling from IPPs is available through thinning \cite{lewis1979simulation}.
This requires introducing an additional rate function that we can sample from
$\mu(t)$ that is also a strict upper bound on the event rate function of
interest such that $\mu(t) \geq \lambda(t)$ for all $t \geq 0$.
%
%
\par
%
% NOTE: This figure is really for the next section, but I wanted it to appear at the top of
% that page, so have included it here, as otherwise would go over to the following page
%
\begin{figure*}[!htpb]
  % \centering
  \begin{subfigure}[t]{0.25\textwidth}
    \centering \includegraphics[width=1\linewidth]{./figs/interpolation_0.pdf}
  \end{subfigure}%
  %
  ~
  \begin{subfigure}[t]{0.25\textwidth}
    \centering \includegraphics[width=1\linewidth]{./figs/interpolation_1.pdf}
  \end{subfigure}
  % ~
  \begin{subfigure}[t]{0.25\textwidth}
    \centering \includegraphics[width=1\linewidth]{./figs/interpolation_2.pdf}
  \end{subfigure}%
  % ~
  \begin{subfigure}[t]{0.25\textwidth}
    \centering \includegraphics[width=1\linewidth]{./figs/interpolation_3.pdf}
  \end{subfigure}
  \caption{Example of the progression of the proposed envelope scheme used for
    thinning. The blue line represents the true event rate, orange section
    depicts the active regions for which we sample a new proposal time, and the
    red section depicts previous segments in the envelope. Starting from the
    left, an initial segment is found through interpolation between time
    points $t_{0}$ and $t_{init}$. In the next segment, active regions of the
    envelope are found by interpolating between the two prior points, which
    extends to create a new segment to propose times. This process continues
    until a proposed time is accepted from thinning.}
  \label{fig:envelope}
\end{figure*}
%
%
%
The efficiency of any thinning scheme relies on the tightness of the upper
bound; the greater the difference between the upper bound and the true rate, the
more likely a proposed time will be rejected when sampling.
\cite{pakman2017stochastic} propose a Bayesian linear regression method to
generate an upper bound suitable for thinning, though require the calculation of
variance within gradients to formulate a suitable upper bound. They calculate
this variance empirically, which requires computing the gradient for each data
point individually within a mini-batch. This computation prohibits use for \Glspl{bnn}
where automatic differentiation
software is used. Furthermore, the solution to the regression requires matrix
inversion which can be numerically unstable without a strong prior, which limits
its application for accelerating sampling within larger models. In the next
section, we address this issue by instead introducing an interpolation-based
scheme for creating efficient and adaptive approximate upper bounds that avoids excessive gradient
computations and the numeric instability of matrix inversion.
%
%
%
%
%%%%%%  ADAPTIVE
%
\section{Adaptive Bounds for samplers}
%
%
%
%
\subsection{Sampling from IPPs with Linear Event Rates}
%
%
Our goal is to create a piecewise-linear envelope $h(t)$ that will serve as an
approximate upper bound of our true event rate, where each segment in $h(t)$ is
represented by $a_{i}t + b_{i}$. This envelope will serve as the event rate for
a proposal IPP that will be suitable for use with the
thinning method of \cite{lewis1979simulation}. Acceptance of an event time $t$
is given by,
\begin{equation}
  \label{eq:thinning}
  U \le \frac{\lambda(t)}{h(t)},
\end{equation}
where $U \sim \text{Uniform}[0, 1]$. We begin by
building on the work of \cite{klein1984time} to demonstrate how to sample times
from an IPP with a piecewise-linear event rate which we can use with thinning.
%
%
\par
%
%
Within our proposal \Gls{ipp} with rate $h(t)$, we wish to generate the next event time $t_{i}$
given the previous event $t_{i-1}$. The probability of events occuring
within the range of $[t_{i-1}, t_{i}]$ is given by \cite{devroye2006nonuniform},
\begin{equation}
  \label{eq:2}
  F(x) = 1 - \text{exp}\{-(\Lambda(t_{i}) - \Lambda(t_{i-1}) )\}
\end{equation}
%
%
We can solve this expression for $t_{i}$ by,
\begin{equation}
  \label{eq:3}
  t_{i} = \Lambda^{-1}(t_{i-1} - U)
\end{equation}
where $U \sim \text{Uniform[0,1]}$. For linear segments, the solution to this
system can be written as \cite{klein1984time},
\begin{equation}
  \label{eq:linear_time}
  t_{i} = \big(-b_{i} + \sqrt{b_{i}^{2} +  a_{i}^{2}t_{i-1}^{2} + 2a_{i}b_{i}t_{i-1} \log(1 - U)}\big ) /a_{i}.
\end{equation}
%
%
This provides a framework for sampling from \Glspl{ipp} with a linear event
rate. We now describe how we create a piecewise-linear envelope for a proposal
process that can be used for thinning.
%
%
\subsection{Piecewise Interpolation for Event Thinning}
\label{sec:interpolation}
%
%
%
We begin by introducing a modified event rate for which we will form our envelope,
%
%
\begin{equation}
  \label{eq:rate_adjust}
  \hat{\lambda}(\mathbf{\omega}(t), \mathbf{v}) = \max \{0, \alpha\nabla U(\mathbf{\omega}) \cdot \mathbf{v}^{i} \},
\end{equation}
%
%
where $\alpha \geq 1$ is a positive scaling
factor to control the tightness of the approximate bound on the rate. The use of $\hat{\lambda}$ for
creating our envelope is valid, since for values
of $\alpha \geq 1$, $\hat{\lambda}(t) \geq \lambda(t)$. The scaling factor included in this
event rate is designed to provide flexibility to end users with respect to
computational time and bias that will be introduced during inference. The closer
$\alpha$ is to one, the lower the probability for rejection of proposed event
times, but the greater the probability that the generated event rate will not be
a strict upper bound.
%
%
\par
%
%
Our goal is to create a piecewise-linear upper bound suitable for proposing
event times using Equation \ref{eq:linear_time}. To achieve this we have two growing sets, one for proposed
event times $T = \{t_{0}, ..., t_{n}\}$ and the value of the adjusted event rates
at these times $L = \{\hat{\lambda}(t_{0}), \dots, \hat{\lambda}(t_{n})\}$ for which we can
create a set of functions,
\begin{equation}
  \label{eq:h(t)}
  h(t) =  a_{i}t + b, \hspace{0.5cm} t \ge t_{i}.
\end{equation}
%
The values for $a_{i}$ and $b_{i}$ are found by interpolating between
the points $(t_{i-1}, \hat{\lambda}(t_{i-1}))$ and $(t_{i}, \hat{\lambda}(t_{i}))$.
%
%
\par
%
%
At the beginning of every deterministic \Gls{pdmp} segment, the sets $T$ and $L$
will be empty. To initialise the sets and create our first linear segment, we
evaluate the event rate at two points, $t_{0}=0$ and $t=t_{init}$,
where $t_{init} > t_{0}$. To evaluate the values for $a_{0}$ and $b_{0}$, we
interpolate between these two segments. Once the values for the first linear
segment are found, $t_{0}$ and $\hat{\lambda}(t_{0})$ are appended to their
corresponding sets, and $t_{init}$ and $\hat{\lambda}(t_{init})$ are discarded. With
this initial linear segment, we can propose a time $t_{i}$ through Equation
\ref{eq:linear_time}. This proposed time is either accepted or rejected from
Equation \ref{eq:thinning}.
%
%
\par
%
%
If the proposed time is accepted, then the dynamics of the \Gls{pdmp} sampler
are updated at the given event time and the sets $T$ and $L$ are cleared, ready
to be re-initialised for the new dynamics. If the time is rejected, the proposed
time $t_{i}$ and envelope evaluation $\hat{\lambda}(t_{i})$ are appended to their
respective sets, and a new linear segment is calculated to interpolate between
this rejected proposal and the previous elements in the sets $T$ and $L$. The
rejected proposal time will serve as the new starting point $(t_{i-1})$ for the
new linear segment to propose the next time using Equation \ref{eq:linear_time}.
This will continue until the proposed event time is accepted. This process depicted visually in Figure \ref{fig:envelope}
and summarised in Algorithm \ref{alg:ipp}.
%
%
%
\par
%
%
Within this work, we limit ourselves to models where the envelope provided
by $h(t)$ will only be an approximate upper bound, meaning bias will likely be
introduced during inference. Diagnosis and correction of this can be identified
through the acceptance ratio $\lambda(t) / h(t)$; if this value is greater than one,
the condition of $h(t)$ being a local upper bound is violated. The amount of
potential bias introduced can be mitigated by increasing the scaling factor $\alpha$ in Equation \ref{eq:rate_adjust} at the
expense of increasing computation load. This property is investigated in Supp.
Material A. In the following sections, we evaluate the proposed event thinning
scheme for \Glspl{bnn} to identify the suitability of different samplers for inference
in these challenging models, and how they can outperform other stochastic approximation
methods in terms of calibration, posterior exploration, sampling efficiency and
predictive performance.
%
%
\begin{algorithm}[!h]
  \SetAlgoLined \KwResult{Proposed PDMP Event Time $\tau$}
  Initialize $T, L$\;
  Evaluate $(0, \lambda(0)), (t_{init}, \lambda(t_{init}))$\;
  $i = 1$\;
  Compute $a_{i}, b_{i}$\;
  $T_{0} \leftarrow 0, L_{0} \leftarrow \lambda(t)$\;
  Discard $t_{init}, \lambda(t_{init}) $\;
  \While{not accepted}{
    \tcp{propose event time with $t_{i-1}, a_{i}$ and $b_{i}$}
    $t_{i} \sim PP \Big(h(t)$\Big)\;
    $u \sim \text{Uniform}[0, 1]$\;
    \If{$u \leq \lambda(t_{i}) / h(t)$}{
      \tcp{sample is accepted}
      $\tau = t_{i}$\;
      accepted = True\;
    }
    \Else{
      \tcp{increment counter}
      $i += 1$\;
      $T_{i} \leftarrow t_{i}, L_{i} \leftarrow \lambda(t_{i})$\;
      \tcp{update linear segment}
      $a_{i}, b_{i}$ = update($L, T$)\;
    }
  }
  \caption{Sampling event rate using proposed adaptive thinning method.}
  \label{alg:ipp}
\end{algorithm}
%
%
%%%%% RELATED
\section{Related Work}
\label{sec:related}
%
%
%
The samplers used within this work require the use of an additional reference
process to provide velocity refreshments. The Generalised BPS
\cite{wu2017generalized} is an updated variant of the BPS algorithm that
incorporates a stochastic update of the velocity which alleviates the
need for a refreshment process. Simulations have shown comparable performance
to the BPS for simple models and how it can reduce the need for fine-tuning the reference
parameter $\tau_{ref}$.
%
%
\par
%
%
Another prominent sampler is the Zig-Zag Process (ZZP)
\cite[]{bierkens2019zig}, where at events the dynamics of a single
parameter are updated. For the one-dimensional case, this sampler
represents the same process as the BPS. This sampler has shown
favourable results in terms of mixing performance and can achieve
ergodicity for certain models where the BPS cannot. A key characteristic
of this method is that each parameter is assigned an individual event
rate, making implementation for high-dimensional \Gls{bnn} models challenging.
%
%
\par
%
%
Another class of algorithms designed for subsampling are discrete
stochastic MCMC methods \cite{wenzel2020good, chen2014stochastic, chen2015complete,
  welling2011bayesian, li2015preconditioned}. These models have shown
favourable performance, with a recent variant achieving comparable
predictive accuracy on the ImageNet data set
\cite{heek2019bayesian}. Compared to algorithms related to PDMPs, it
has been shown that high variance related to naive subsampling limits
these methods to provide only an approximation to the posterior
\cite{betancourt2015fundamental}. The bias that is introduced due to subsampling can be controlled by
reducing the step-size for these methods at the expense of mixing performance
and posterior exploration\cite{nagapetyan2017true,brosse2018promises,teh2016consistency}.
We investigate the effect of this property for \Gls{sgld} and compare performance with \Gls{pdmp}
samplers in the following section.
%
%
%
%%%%%  EXPERIMENTS
\section{Experiments}
\label{sec:experiments}
%
%
We now validate the performance of \Glspl{pdmp} using the proposed event
sampling method on a number of synthetic and real-world data sets for regression
and classification. To analyse performance for predictive tasks, the predictive
posterior needs to be evaluated. In this work, we discretise samples from the
trajectory to allow for Monte Carlo integration,
\begin{multline}
  \label{eq:discretise}
  p(y^* | x^*, \mathcal{D}) = \int \pi(\mathbf{\omega}) p(y^* | \mathbf{\omega}, x^*) d\mathbf{\omega} \\
  \approx \dfrac{1}{N} \sum_{i=1}^N p(y^* | \mathbf{\omega}_i, x^*)
  \hspace*{0.5cm} \mathbf{\omega}_i \sim \pi(\mathbf{\omega}),
\end{multline}
%
%
where parameter samples of $\mathbf{\omega}^{(i)}$ are taken from the values
encountered at event times. Experimentation is first conducted on synthetic data
sets to allow us to easily visualise predictive performance and uncertainty in
our models, followed by more difficult classification tasks with Bayesian
\Glspl{cnn} on real data sets. For all experimentation, we set our scaling
factor from Equation \ref{eq:rate_adjust} to $\alpha=1.0$ to promote computational
efficiency. To enable these experiments, we deliver a Python package titled
Tensorflow PDMP (TPDMP). This package utilises the Tensorflow Probability
library \cite{dillon2017tensorflow}, allowing for hardware acceleration and
graph construction of all our models to accelerate computation. We deliver
kernels to implement the \Gls{bps}, \sigbps, and Boomerang sampler with our
proposed event thinning scheme. Code is available
at
\href{https://github.com/egstatsml/tpdmp.git}{https://github.com/egstatsml/tpdmp.git}.
%
%
\begin{figure*}[!hbt]
  % \centering
  \begin{subfigure}[t]{0.18\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/regression/toy_b_bps/pred.pdf}
    \caption{BPS}
  \end{subfigure}%
  ~
  \begin{subfigure}[t]{0.18\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/regression/toy_b_cov_pbps/pred.pdf}
    \caption{\sigbps}
  \end{subfigure}
  ~
  \begin{subfigure}[t]{0.18\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/regression/toy_b_boomerang/pred.pdf}
    \caption{Boomerang}
  \end{subfigure}%
  ~
  \begin{subfigure}[t]{0.18\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/regression/toy_b_sgld/pred.pdf}
    \caption{SGLD}
  \end{subfigure}
  ~
  \begin{subfigure}[t]{0.18\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/regression/toy_b_sgld_no_decay/pred.pdf}
    \caption{SGLD-ND}
  \end{subfigure}
  \caption{Examples of the different PDMP samplers using the proposed event
    thinning procedure on synthetic regression task compared against SGLD with
    decaying learning rate and constant learning rate (SGLD-ND).}
  \label{fig:regression}
\end{figure*}
%
%
\subsection{Regression and Binary Classification with Synthetic Data}
\label{sec:reg_bin}
%
%
%
%
To visualise predictive performance and uncertainty estimation, regression and
binary classification tasks are formed on synthetic data sets. Description of
the networks used for these tasks is described in Supp. Material E. Before
sampling, a MAP estimate was first found using stochastic optimisation, and was
used to initialise each sampler. 2,000 samples were generated using each
sampling method, with each sampler initialised from the same MAP estimate. The
\sigbps requires an additional warmup period to identify suitable values for the
preconditioner. We achieve this by performing 1,000 initial samples using the
BPS, and standard deviation parameters used for the preconditioner are estimated
from these samples using the Welford algorithm \cite{welford1962note}. These
preconditioner values are then fixed throughout the sampling process. For the
Boomerang Sampler, the preconditioner scaling factor from Equation
\ref{eq:boomerang_sigma} is set to $\gamma = 500.0$. The \Gls{pdmp} methods are
compared against \Gls{sgld} which starts with a learning rate that decays to
zero as required \cite{welling2011bayesian, nagapetyan2017true}, and with no
decay of the learning rate as is commonly done in practice (SGLD-ND). Examples
of the predictive posterior distribution for regression and binary
classification are shown in Figures \ref{fig:regression} and \ref{fig:logistic}
respectively, with full analysis in Supp. Material B. All \Gls{pdmp} models are
fit with the proposed adaptive event thinning procedure.
%
%
\par
%
%
Results from these experiments affirm that inference from the PDMP models is
suitable for predictive reasoning, with low variance seen within the range of
observed data and greater variance as distance from observed samples increases.
We similarly see an increase in uncertainty along the decision boundary, which
is a desireable property. This is in contrast to \Gls{sgld}, which is unable to
offer suitable predictive uncertainty, even in the case for larger
non-decreasing learning rates. This highlights the known limitations of SGLD,
that with a decaying learning rate it can fail to explore the posterior, and
with a larger non-decreasing learning rate will converge to dynamics offered by
traditional SGD \cite{brosse2018promises,nagapetyan2017true}.
%
%
\par
%
%
These tests indicate promising performance in terms of predictive accuracy and
uncertainty estimates. To further demonstrate classification performance, we
move to larger and more complicated models for performing classification on
real-world data sets.
%
%
\begin{figure}[!h]
  % \centering
  \begin{subfigure}[t]{0.25\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/moons/grid_mean_logistic_bps.pdf}
  \end{subfigure}%
  % ~
  \begin{subfigure}[t]{0.25\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/moons/grid_mean_logistic_sgld.pdf}
  \end{subfigure}
  \\
  \begin{subfigure}[t]{0.25\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/moons/grid_var_logistic_bps.pdf}
    \caption{BPS}
  \end{subfigure}%
  % ~
  \begin{subfigure}[t]{0.25\textwidth}
    \centering
    \includegraphics[width=1\linewidth]{./figs/moons/grid_var_logistic_sgld.pdf}
    \caption{SGLD}
  \end{subfigure}
  \caption{Examples of the predictive mean and variance for synthetic
    classification task. Left column illustrates results using the BPS and the
    right using SGLD. We see increased uncertainty for the BPS around the
    decision boundary, whilst SGLD shows greater certainty.}
  \label{fig:logistic}
\end{figure}
%
%
\subsubsection{UCI-Datasets}
\label{sec:uci}
We further evaluate the performance of the PDMP samplers enabled by the proposed
event sampling scheme on datasets from the UCI repository \cite{ucidatasets}. In
Table \ref{tab:uci}, we show performance metrics on the Boston houses dataset,
with the Naval, Energy, Yacht, and Concrete datasets evaluated in Supp. Material
E.2. Each model is fit with 2,000 samples. For these experiments, we further
include the naive \Gls{sghmc} \cite{chen2014stochastic}. Predictive performance
of these models is measured with Mean Squared Error (MSE) and Negative
Log-Likelihood (NLL). Sampling efficiency is evaluated with Effective Sample
Size (ESS) \cite{robert1999monte}. Due to the high dimension of our models, we
perform PCA on returned samples and project them onto the first principal
component to report ESS on the direction of greatest variance within samples.
% ----
\par
% ----
From these results, we see \Gls{sgld} and \Gls{sghmc} provide a slight
improvement in terms of MSE and NLL, though we see that the Boomerang Sampler
considerably outperforms both of these methods in terms of sample efficiency.
This result follows from the previous sections where we see that SGLD frequently
converges to the SGD solution space, whilst the PDMP samplers can explore the
posterior space. Additional results in Supp. Material E.2 further validate these
results.
%
%
%
\begin{table}[t!]
  \caption{Summary of predictive performance using PDMP samplers with the
    proposed event time sampling methods on the Boston Houses dataset. Negative
    log-likelihood (NLL) and Mean Squared Error (MSE) are reported. Effective sample size (ESS) is measured over the first
    principal component of samples. Results are shown over 5 independent runs
    with standard deviations reported.}
  \label{tab:uci}
  \begin{center}
    \begin{small}
      % \begin{sc}
      \scalebox{0.85}{
        %
        \begin{tabular}{l l l l l}
          \toprule
          {\bfseries Inference} & {\bfseries NLL $\downarrow$}
                                & {\bfseries MSE $\downarrow$}
                                & {\bfseries ESS $\uparrow$}
          \\
          \midrule\midrule[.1em]
          BPS                   & 51.26 $\pm$ 0.19             & 3.81 $\pm$ 0.08          & 2.73 $\pm$ 0.03
          \\
          \sigbps               & 51.14 $\pm$ 0.10             & 3.76 $\pm$ 0.05          & 2.74 $\pm$ 0.05

          \\
          Boomerang             & 51.40 $\pm$ 0.32             & 3.87 $\pm$ 0.14          & \textbf{1974.73 $\pm$ 34.83}
          \\
          SGLD-ND               & \textbf{51.07 $\pm$ 0.00}    & \textbf{3.73 $\pm$ 0.00} & 2.87 $\pm$
          0.00
          \\
          SGHMC                 & 51.08 $\pm$ 0.08             & 3.74 $\pm$ 0.03          & 2.72 $\pm$
          0.01
          \\
          \bottomrule
        \end{tabular}
      }
    \end{small}
  \end{center}
\end{table}

\subsection{Multi-Class Classification}
\label{sec:classification}
%
%
We now evaluate the performance of the proposed sampling procedures on the
popular MNIST \cite{lecun1998gradient}, Fashion MNIST \cite{xiao2017fashion},
SVHN \cite{netzer2011reading}, CIFAR-10 and CIFAR-100
\cite{krizhevsky2009learning} data sets using \Glspl{cnn}. For MNIST and
Fashion-MNIST, the LeNet5 architecture was used whilst for SVHN, CIFAR-10, and
CIFAR-100 the modified ResNet20 architecture from \cite{wenzel2020good} was
used. Each parameter was again supplied a standard normal prior.
%
%
\begin{table}[t!]
  \caption{Summary of predictive performance using PDMP samplers with the
    proposed event time sampling methods. Negative log-likelihood (NLL) is reported, along with calibration  measured using the expected
    calibration error (ECE) \cite{guo2017calibration}. Effective sample size (ESS) is measured over the first
    principal component of samples. Mean and standard deviation in results
    presented over 5 independent runs.}
  \label{tab:conv}
  \begin{center}
    \begin{smaller}
      % \begin{sc}
      \scalebox{0.85}{
        %
        \begin{tabular}{l l l l l}
          \toprule
          {\bfseries Inference}    & {\bfseries ACC $\uparrow$}
                                   & {\bfseries NLL $\downarrow$}
                                   & {\bfseries ECE $\downarrow$}
                                   & {\bfseries ESS $\uparrow$}
          \\
          \midrule\midrule[.1em]
          \multicolumn{5}{c}{MNIST}
          \\ \midrule
          % type ipp p lam warm iter acc NLL ece tim
          BPS                      & \textbf{0.99 $\pm$ 0.01}     & 62.63 $\pm$ 5.60         & 1.05 $\pm$ 0.12          & 2.70 $\pm$ 0.03             \\
          \sigbps                  & 0.97 $\pm$ 0.03              & 51.72 $\pm$ 10.85        & 0.88 $\pm$ 0.41          & 2.74 $\pm$ 0.04             \\
          Boomerang                & \textbf{0.99 $\pm$ 0.00}     & 0.18 $\pm$ 0.05          & \textbf{0.02 $\pm$ 0.00} & \textbf{138.17 $\pm$ 47.08} \\
          SGLD                     & \textbf{0.99 $\pm$ 0.00}     & 6.10 $\pm$ 0.00          & 0.09 $\pm$ 0.00          & 19.88 $\pm$ 0.02            \\
          SGLD-ND                  & \textbf{0.99 $\pm$ 0.00 }    & 77.05 $\pm$ 0.00         & 1.52 $\pm$ 0.00          & 3.40 $\pm$ 0.00             \\
          SGHMC                    & \textbf{0.99 $\pm$ 0.00}     & \textbf{0.14 $\pm$ 0.02} & \textbf{0.02 $\pm$ 0.00} & 2.71 $\pm$ 0.00             \\
          \midrule
          \multicolumn{5}{c}{Fashon-MNIST}                                                                                                            \\
          \midrule
          BPS                      & \textbf{0.91 $\pm$ 0.00}     & 16.79 $\pm$ 1.65         & 0.44 $\pm$ 0.02          & 2.74 $\pm$ 0.02             \\
          \sigbps                  & 0.90 $\pm$ 0.00              & \textbf{3.43 $\pm$ 1.00} & 0.32 $\pm$ 0.02          & 2.79 $\pm$ 0.12             \\
          Boomerang                & \textbf{0.91 $\pm$ 0.00}     & 3.82 $\pm$ 0.29          & \textbf{0.31 $\pm$ 0.01} & \textbf{200.00 $\pm$ 0.00}  \\
          SGLD                     & \textbf{0.91 $\pm$ 0.00}     & 5.53 $\pm$ 0.00          & 0.30 $\pm$ 0.00          & 19.85 $\pm$ 0.02            \\
          SGLD-ND                  & \textbf{0.91 $\pm$ 0.00}     & 69.17 $\pm$ 0.01         & 1.58 $\pm$ 0.00          & 3.59 $\pm$ 0.00             \\
          SGHMC                    & \textbf{0.91 $\pm$ 0.00}     & 4.63 $\pm$ 0.13          & 0.34 $\pm$ 0.00          & 2.71 $\pm$ 0.00             \\
          \midrule
          \multicolumn{5}{c}{SVHN}                                                                                                                    \\
          \midrule
          BPS                      & 0.95 $\pm$ 0.00              & 35.35 $\pm$ 5.91         & 0.61 $\pm$ 0.10          & 2.69 $\pm$ 0.02             \\
          \sigbps                  & 0.95 $\pm$ 0.00              & \textbf{0.36 $\pm$ 0.11} & \textbf{0.19 $\pm$ 0.00} & 2.74 $\pm$ 0.05             \\
          Boomerang                & 0.95 $\pm$ 0.00              & 0.50 $\pm$ 0.07          &
          \textbf{0.19 $\pm$ 0.00} & \textbf{186.33 $\pm$ 21.10}                                                                                      \\
          SGLD                     & 0.95 $\pm$ 0.00              & 7.01 $\pm$ 10.12         & 0.24 $\pm$ 0.10          & 16.61 $\pm$ 6.44            \\
          SGLD-ND                  & \textbf{0.96 $\pm$ 0.00}     & 27.32 $\pm$ 0.08         & 0.44 $\pm$ 0.00          & 3.73 $\pm$ 0.00             \\
          SGHMC                    & 0.95 $\pm$ 0.00              & 0.47 $\pm$ 0.05          & \textbf{0.19 $\pm$ 0.00} & 2.71 $\pm$ 0.00             \\
          \midrule
          \multicolumn{5}{c}{CIFAR-10}                                                                                                                \\
          \midrule
          BPS                      & 0.79 $\pm$ 0.01              & 42.03 $\pm$ 2.42         & 1.18 $\pm$
          0.07                     &
          2.82 $\pm$ 0.10                                                                                                                             \\
          \sigbps                  & 0.79 $\pm$ 0.00              & \textbf{5.93 $\pm$ 6.43} & 0.70 $\pm$ 0.05          & 2.75 $\pm$ 0.08             \\
          Boomerang                & 0.81 $\pm$ 0.00              & 6.71 $\pm$ 2.20          & \textbf{0.64 $\pm$ 0.06} & \textbf{200.00 $\pm$ 0.00}  \\
          SGLD                     & 0.81 $\pm$ 0.00              & 13.31 $\pm$ 0.01         & 0.85 $\pm$ 0.00          & 19.83 $\pm$ 0.00            \\
          SGLD-ND                  & \textbf{0.82 $\pm$ 0.00}     & 31.12 $\pm$ 0.13         & 0.89 $\pm$ 0.00          & 4.04 $\pm$ 0.00             \\
          SGHMC                    & 0.80 $\pm$ 0.00              & 13.84 $\pm$ 0.27         & 0.92 $\pm$ 0.02          & 2.71 $\pm$ 0.00             \\
          \midrule
          \multicolumn{5}{c}{CIFAR-100}                                                                                                               \\
          \midrule
          BPS                      & 0.57 $\pm$ 0.01              & 42.45 $\pm$ 1.63         & 2.48 $\pm$ 0.09          & 2.69 $\pm$ 0.02             \\
          \sigbps                  & 0.63 $\pm$ 0.00              & 8.27 $\pm$ 0.37          & 1.39 $\pm$ 0.00          & 2.78 $\pm$ 0.08             \\
          Boomerang                & \textbf{0.64 $\pm$ 0.00}     & \textbf{6.85 $\pm$ 0.86} & \textbf{1.35 $\pm$ 0.01} & \textbf{162.21 $\pm$ 43.74} \\
          SGLD                     & \textbf{0.64 $\pm$ 0.00}     & 12.40 $\pm$ 0.07         & 1.45 $\pm$ 0.00          & 20.40 $\pm$ 0.07            \\
          SGLD-ND                  & \textbf{0.64 $\pm$ 0.00}     & 11.10 $\pm$ 0.05         & 1.42 $\pm$ 0.00          & 2.83 $\pm$ 0.00             \\
          SGHMC                    & \textbf{0.64 $\pm$ 0.00}     & 12.34 $\pm$ 0.10         & 1.45 $\pm$ 0.00          & 2.71 $\pm$ 0.00             \\
          \bottomrule
        \end{tabular}
      }
      % \end{sc}
    \end{smaller}
  \end{center}
\end{table}
%

%
%
\par
%
%
Similar to the experiments on regression, a MAP estimate is found and used to
initialise each sampler. 2,000 samples for each model are then generated, though
a thinning factor
of 10 is used to reduce the number of returned samples used for prediction to 200.
For these models, we measure predictive performance and calibration
through the Accuracy, NLL, and Expected Calibration Error (ECE)
\cite{guo2017calibration}, and similarly measure sampling efficiency using the
ESS with on samples after performing PCA.  A full description of the models
used, and experiment parameters is shown in Supp. Material E.3. Table
\ref{tab:conv} summarises the results of these experiments.
%
%
\begin{figure}[!t]
  \centering
  \begin{subfigure}[t]{0.18\textwidth}
    \centering \includegraphics[width=\linewidth]{./figs/autocorrelation/svhn_body/autocorr_most.pdf}
    \caption{ACF first principal component}
  \end{subfigure}%
  ~
  \begin{subfigure}[t]{0.18\textwidth}
    \centering \includegraphics[width=\linewidth]{./figs/autocorrelation/svhn_body/trace_most.pdf}
    \caption{Trace first principal component}
  \end{subfigure}
  \caption{Example of ACF and trace plots for first principal
    component of the samples from network fit on SVHN dataset.}
  \label{fig:autocorr}
\end{figure}
%
%
\begin{figure}[h]
  \centering
  \begin{subfigure}[t]{0.13\textwidth}
    \centering \includegraphics[width=\linewidth]{./figs/svhn_compare/0_27_image.png}
    \caption{Image Class 0}
  \end{subfigure}%
  \\
  \begin{subfigure}[t]{0.20\textwidth}
    \centering \includegraphics[width=\linewidth]{./figs/svhn_compare/boom.pdf}
    \caption{Boomerang Sampler}
  \end{subfigure}%
  ~
  \begin{subfigure}[t]{0.20\textwidth}
    \centering \includegraphics[width=\linewidth]{./figs/svhn_compare/sgld.pdf}
    \caption{SGLD}
  \end{subfigure}%
  \caption{Examples from predictive posterior for difficult-to-classify samples from SVHN. Top row shows the original image and the bottom row shows the predictive distribution for the Boomerang sampler and SGLD. The mean for each class represented by the dot, and the 95\% credible intervals shown with the error bars.}
  \label{fig:class_pred}
\end{figure}
%
%
%
%
\par
%
%
These results highlight favourable performance for certain samplers. The BPS
sampler is unable to provide calibrated predictions, whilst the \sigbps and
Boomerang samplers consistently provide calibrated predictive
performance and reduction in NLL. Most importantly, we note the  Boomerang
sampler consistently outperforms other samplers in terms of effective
sample size, whilst also promoting competitive or improved predictive accuracy.
This highlights the potential for the Boomerang sampler for probabilistic inference within neural networks.
%
%
\par
%
%
With measures of predictive performance and ESS within our models, we wish to further investigate the mixing properties of the samplers
presented within to identify how well the posterior space is being explored. ESS only gives a measure to approximate the number of independent samples within our MCMC chain, though we are also interested in how well the support for the posterior is being explored.
Given the large number of parameters seen within a BNN, it is infeasible to
evaluate the coordinate trace and autocorrelation plots for individual parameters as is
typically done for MCMC models. Instead, we again perform PCA to reduce the
dimension of our data and investigate the trace plots of the first principal component as illustrated in Figure.
\ref{fig:autocorr}. From these figures, we can identify strong correlation
between samples from the \Gls{bps}, \sigbps, SGHMC, and SGLD-ND solutions. \Gls{sgld} offers reduced correlation in samples, however as seen in the trace plot, samples fail to explore the posterior and instead converge to a steady
state, whilst the Boomerang sampler provides considerably reduced correlation and more favourable mixing.
Convergence of the \Gls{sgld} samples can be attributed to the reduction in learning rate required to target the posterior. We verify this result in Supp.
Material C, where we provide further analysis into results from all networks and remaining principal components. The effect of this convergence in terms of predictive uncertainty is illustrated within Figure \ref{fig:class_pred},
where the \Gls{pdmp} sampler is able to provide more meaningful uncertainty estimates for
difficult-to-classify samples, and the \Gls{sgld} predictive results converge to that
similar of a point estimate. Additional examples of the predictive distributions is
shown in Supp. Material H.
%
%
\par
%
%
Probabilistic methods have shown favourable performance in terms of \Gls{ood}
detection \cite{grathwohl2019your, maddox2019simple}. Given the point-estimate-like nature of the results returned by \Gls{sgld}, we wish to
compare with results from the Boomerang sampler to see if both can offer similar
performance for \Gls{ood} data. We see in Figure \ref{fig:entropy} that the
Boomerang sampler offers greater entropy for \Gls{ood} data, indicating a
desireable increase in aleatoric uncertainty. Additional analysis is provided in Supp. Material G. Given the consistent
predictive performance, quality of uncertainty estimates, and posterior exploration, we would recommend
researchers wishing to apply MCMC methods for \Glspl{bnn} consider the use of the Boomerang sampler.
%
%
%
%
\begin{figure}[!t]
  \centering
  \begin{subfigure}[t]{0.20\textwidth}
    \centering \includegraphics[width=\linewidth]{./figs/in_entropy.pdf}
    \caption{In-distribution}
  \end{subfigure}%
  ~
  \begin{subfigure}[t]{0.20\textwidth}
    \centering \includegraphics[width=\linewidth]{./figs/out_entropy.pdf}
    \caption{OOD}
  \end{subfigure}
  \caption{Entropy histograms comparing SGLD and Boomerang sampler fit on
    the CIFAR-10 dataset. OOD data represented by SVHN. We see the
    predictive entropy from the Boomerang sampler increases as desired for OOD
    data, whilst SGLD remains overly confident for erroneous samples.}
  \label{fig:entropy}
\end{figure}
%
%
%
%%%%%% DISCUSSION
\section{Discussion and Limitations}
\label{sec:discussion}
%
%
Whilst the \Gls{pdmp} methods have shown favourable performance in terms of predictive
accuracy, calibration and uncertainty in \Glspl{bnn}, there are certain
challenges with fitting them. The PDMP samplers used within this work are designed to target the joint
distribution,
\begin{equation}
  \label{eq:joint}
  p(\myw, \myv) = \pi(\myw) \Phi(\myv)
\end{equation}
%
where $\pi(\myw)$ is the target posterior and $\Phi(\myv)$ is the distribution of the
auxiliary velocity components which must be set by users in the form of the
refreshment distribution. For the \Gls{bps} and \sigbps samplers, it has been
shown that with a reference distribution may be a Gaussian or restricted to the
unit hypersphere \cite{bouchard2018bouncy}. For the Boomerang sampler, the
velocity distribution is designed with respect to a reference measure to ensure
invariance to the target distribution, such that $\Phi(\myv) = \mathcal{N}(0, \Sigma_{\star})$,
where $\Sigma_{\star}$ is the same factor used to precondition the dynamics. The choice
in distribution used for the velocity component has an explicit effect on the
mixing capabilities of the models when applied to \Glspl{bnn}. We demonstrate
this effect in Supp. Material D.1. We find that a velocity distribution with too
much variance can cause effects similar to that of divergences seen in HMC and
NUTS. Furthermore, we see that with variance set too low, the samplers can fail
to explore the posterior sufficiently to provide the desired meaningful
uncertainty estimates. A similar effect can be seen for the choice of
refreshment rate, which we investigate in Supp. Material D.2. We highlight these
limitations as areas for future research to enable robust application of
\Gls{pdmp} methods for \Glspl{bnn}.
%
%
\par
%
%
The Boomerang sampler as implemented within this work and the original paper is
probabilistic, though is not purely Bayesian. This is due to the reference
measure for the velocity being identified through the data itself. A strictly
Bayesian approach can be recovered by setting the reference measure and
associated preconditioner matrix from a prior distribution, though we would lose
some favourable sampling performance offered by this sampler. We can view the
approach implemented within similar to an empirical Bayes, where we are gleaning
information about the prior (reference measure for the Boomerang sampler), from
the data itself. Given the difficulty of specifying a meaningful and informative
prior, and the success seen when using emperical priors for \Glspl{bnn}
\cite{krishnan2020specifying}, we believe the use of such an approach for the
Boomerang sampler is justified.
%
%
%%%%%% Conclusion
\section{Conclusion}
%
%
Within this work, we demonstrate how \Glspl{pdmp} can be used for \Glspl{bnn}.
We provide a flexible piecewise linear bound to enable sampling of event times
within these frameworks that permits inference in \Glspl{bnn}. A
GPU-accelerated software package is offered to increase the availability of
PDMPs for a wide array of models. Experimentation on BNNs for regression and
classification indicates comparable or improved predictive performance and
calibration, though were able to consistently improve sampling efficiency
and uncertainty estimation when compared against existing stochastic inference
methods.
%
%
%
%
%
%
% References
\bibliography{ref.bib}
\end{document}
