\documentclass[accepted]{uai2024} % for initial submission
%\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
% \usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{xcolor}         % colors
\usepackage{amsmath}
\usepackage{amsthm}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithm,algorithmic}
\usepackage{csquotes}
 \DeclareGraphicsExtensions{.pdf,.png}
\DeclareMathOperator*{\argmax}{arg\,max}

\theoremstyle{definition}
\newtheorem{definition}{Definition}



%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Functional Wasserstein Variational Policy Optimization}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<Junyu.Xuan@uts.edu.au>?Subject=Your UAI 2024 paper}{Junyu~Xuan}{}}
\author[1]{Mengjing~Wu}
\author[1]{Zihe~Liu}
\author[1]{Jie~Lu}
% Add affiliations after the authors
\affil[1]{%
    Australian Artificial Intelligence Institute\\
    University of Technology Sydney\\
    Ultimo NSW 2007, Australia\\
}

\begin{document}
\maketitle

\begin{abstract}
  Variational policy optimization has become increasingly attractive to the reinforcement learning community because of its strong capability in uncertainty modeling and environment generalization. However, almost all existing studies in this area rely on Kullback–Leibler (KL) divergence which is unfortunately ill-defined in several situations. In addition, the policy is parameterized and optimized in weight space, which may not only bring additional unnecessary bias but also make the policy learning harder due to the complicatedly dependent weight posterior. In the paper, we design a novel functional Wasserstein variational policy optimization (FWVPO) based on the Wasserstein distance between function distributions. Specifically, we firstly parameterize policy as a Bayesian neural network but from a function-space view rather than a weight-space view and then propose FWVPO to optimize and explore the functional policy posterior. We prove that our FWVPO is a valid variational Bayesian objective and also guarantees the monotonic expected reward improvement under certain conditions. Experimental results on multiple reinforcement learning tasks demonstrate the efficiency of our new algorithm in terms of both cumulative rewards and uncertainty modeling capability.
\end{abstract}

\section{INTRODUCTION}
\label{introduction}



Reinforcement learning aims to optimize a policy that could yield high cumulative rewards when interacting with a given environment. One straightforward solution is to parameterize the policy, represent the cumulative reward as a function of the policy, and maximize the cumulative reward by optimizing the policy. Such a solution is named policy optimization\footnote{We only consider the gradient-based policy optimization so we interchange the terms optimization and policy gradient for in the remainder of this paper.} \citep{schulman2015trust, schulman2017proximal, huang2021bregman} or policy gradient \citep{williams1992simple,li2021softmax,drpg}. Popular algorithms include trust region policy optimization (TRPO) \citep{schulman2015trust}, proximal policy optimization (PPO) \citep{schulman2017proximal}, Bregman gradient policy optimization \citep{huang2021bregman}, and so on. Almost all of these works parameterize the policy as a determinate deep neural network in which capability in uncertainty modeling and environment generalizing is limited \citep{furmston2010variational}. 


One way to improve the ability of uncertainty modeling and environment generalizing is to parameterize the policy as a probabilistic model \citep{furmston2010variational,levine2018reinforcement,xu2018variational}. Among all possible probabilistic models, Bayesian neural networks (BNNs) \citep{blundell2015weight,foong2020expressiveness}, which assign probabilistic distributions on all weights of the neural networks are one of the most popular options because they absorb the advantages of deep neural networks on the powerful function approximation. The underlying reason for this parameterization is that it can transform the reinforcement learning as a probabilistic inference problem and then various approximate probabilistic inference algorithms can be used to provide additional flexibility and representation power \citep{levine2018reinforcement,zhang2020variational,zhang2018policy} and effective reasoning about uncertainty \citep{fellows2019virel, liu2017stein}. Hence, variational inference \citep{blei2017variational} has been broadly used to improve the policy optimization (named variational policy optimization), where a Kullback–Leibler (KL) divergence is added to constrain the posterior distribution of the policy. One representative work is the maximum entropy policy optimization (MEPO) \citep{levine2018reinforcement,liu2017stein}. 
 


Unfortunately, there is no such thing as a free lunch. Introducing BNN and variational inference to policy optimization also brings additional difficulties, like i) the widely used Gaussian priors for network parameters are not always applicable due to their possible pathological features, such as prior samples tend to be horizontally linear for deep nets \citep{duvenaud2014avoiding, tran2020functional}; ii) the effects of the given priors on posterior inference for weights and further on the resulting distributions over model outputs in function space are unclear and hard to control owing to the complex architecture and nonlinear nature of BNNs \citep{ma2021functional, wild2022generalized2}. Both these difficulties source from the independent distributed prior and strong and complicatedly dependent posterior of policy network weights. 


In this paper, instead of parameterizing policy in weight space, we propose a functional variational policy optimization algorithm, where a policy is given a functional prior and its posterior is optimized in function space \citep{williams2006gaussian}. Although there are some recent ingenuous works on functional variational inference for BNNs \citep{sun2018functional,wang2018function,ma2021functional}, they are all based on KL divergence which has natural relationship with data log likelihood (the variational objective is a lower bound of data log-likelihood) but is either infinite or ill-defined in several situations \citep{gray2011entropy,burt2020understanding}, like non-overlapping supports. Moreover, the KL divergence is known vulnerable to collapse to local mode \citep{neumann2011variational} and hence sensitive to the initialization (please see Section 3 for more discussions). Therefore, when the existing functional variational inference with KL divergence is directly used as the surrogate objective function for policy optimization, it could be harmful to the monotonic improvement of each step and may lead to instability. Our basic idea in a nutshell is to use Wasserstein distance \citep{arjovsky2017wasserstein,ambrogioni2018wasserstein} between policy posterior and prior as the constraint and use functional Wasserstein variational inference as the surrogate objective function for policy optimization. Our main contributions are summarised as follows,
\begin{itemize}
    \item We propose a new functional Wasserstein variational inference based on 1-Wasserstein distance rather than KL divergence, where the new objective is proven to be a valid and tighter (compared with KL) variational Bayesian objective.
    \item We derive a functional Wasserstein variational policy optimization (FWVPO), prove the monotonic improvement guarantee and demonstrate the improvement compared with KL divergence.
\end{itemize}


\section{BACKGROUND}
\label{background}

\subsection{Reinforcement learning}

Reinforcement learning can be formalized by Markov decision processes (MDP). A MDP is defined by a tuple $\{\mathcal{S}, \nu, \mathcal{A}, R, P\}$, where $\mathcal{S}$ is the state space, $\nu$ is the starting distribution of states, $\mathcal{A}$ is the action space, ${R}(r | s, a)$ is the reward function ${R}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, and ${P}(s_{t+1}|s_t, a)$ is the state transition probability. A policy $\pi(a | s; \theta)$ is a distribution over actions given a state, with $\theta$ as the parameter set. When a deep neural network is used to model $\pi_\theta$, $\theta$ contains the weights of the network. With discount factor $\gamma \in (0, 1)$, the expected discounted reward under $\pi_\theta$ is defined as
\begin{equation}
\eta(\pi_\theta) = \mathbb{E}_{s_0, a_0, ...}\left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]
\label{eq:rewardobj}	
\end{equation} 
where $s_0 \sim \nu, a_0 \sim \pi(s_0), s_1 \sim P(s_0, a_0), \ldots$. The definitions of standard concepts, including state-action value function $Q_{\pi}(s, a)$, state value function $V_{\pi}(s)$, and the advantage function $A_{\pi}(s, a)$, follow the ones in TRPO \citep{schulman2015trust} and are given in the Supplementary. 

\subsection{Policy optimization}

The aim of policy optimization algorithms \citep{williams1992simple} is to maximize the expected discounted reward in (\ref{eq:rewardobj}) by optimizing the policy parameters. 
One problem of the standard policy gradient is the possible collapse due to a large update step. TRPO \citep{schulman2015trust} nicely avoids this kind of collapse through a KL divergence between the old and new policies that are given to restricting the update, and a ratio to compensate the difference between trajectory collecting (old) policy $\pi_{\theta_{\text{old}}}$ and current policy $\pi_\theta$ by
\begin{equation}
\begin{aligned}
\max_{\theta} \quad J^{\text{TRPO}}(\theta) 
- \alpha \mathcal{KL}\left[\pi_{{\theta}_{\text{old}}} \| \pi_\theta \right]
\end{aligned}
\label{trpo}
\end{equation} 
where $\alpha$ is a hyperparameter and $J^{\text{TRPO}}(\theta) = \mathbb{E}_{s_0, a_0, ...} \left[\frac{\pi(a|s; \theta)}{\pi(a|s; {\theta_{\text{old}}})} A_{\pi_{\theta_{\text{old}}}}(s, a)\right]$. 
PPO \citep{schulman2017proximal} further extends TRPO by introducing a clipped surrogate that maximizes the cost function while ensuring the deviation from the previous policy is relatively small
$J^{\text{PPO}}(\theta) = \mathbb{E}_{s_0, a_0, ...} \Big [
\min ( \frac{\pi(a|s; \theta)}{\pi(a|s; {\theta_{\text{old}}})} A_{\pi_{\theta_{\text{old}}}}(s, a), \text{clip} \big (
\frac{\pi(a|s; \theta)}{\pi(a|s; {\theta_{\text{old}}})}, 
\allowbreak 1-\epsilon, 1+\epsilon   \big )A_{\pi_{\theta_{\text{old}}}}(s, a)  ) \Big]$ where $\epsilon$ is a hyperparameter. Variational policy optimization (like MEPO) \citep{levine2018reinforcement, liu2017stein} is to introduce a policy prior distribution $p_0(\theta)$ and the target is to optimize the approximated policy posterior distribution $q(\theta)$ by  
$\max_{q} \mathbb{E}_{q(\theta)} \left[ J(\theta) \right] -  \alpha \mathcal{KL}[q(\theta) \| p_0(\theta)]$, where $\alpha$ is hyperparameter and $J(\theta)$ can be any surrogate term, like $J^{\text{TRPO}}(\theta)$ or $J^{\text{PPO}}(\theta)$, and the optimal posterior can be directly deduced as $q^*(\theta) \propto \exp \left ( J(\theta) \right) p_0(\theta)$. 


\subsection{Bayesian neural networks}

A Bayesian neural network (BNN) \citep{blundell2015weight,foong2020expressiveness} is a neural network whose weights are given (normally independent) prior distributions. 
Given a dataset $\mathcal{D} = \{x_i, y_i\}$ where $x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$, a simple one-hidden-layer example is given as $y(x) = \theta^{(0)} + \theta^{(1)} \sigma \left(\theta^{(2)} + \theta^{(3)} x \right)$, 
where $\sigma$ is a nonlinear activation function, and $\theta = \{\theta^{(0)}, \theta^{(1)}, \theta^{(2)}, \theta^{(3)}\}$ are neural network weights with prior distribution $p_0(\theta)$, such as i.i.d. Gaussian distributions. There are various approximate inference methods to optimize their complex posterior distributions, such as variational inference (VI) \citep{blundell2015weight} and Hamiltonian Monte Carlo \citep{cobb2021scaling}. Here, we briefly introduce the mean-field VI for BNN, which proposes some simple (like Gaussian) independent variational distributions $q(\theta; \vartheta)$ with
$\theta \sim \text{Gaussian}(\bar{\mu}, \bar{\rho}) $,
where $\vartheta = \{ \bar{\mu}, \bar{\rho}\}$ is also named as the variational parameter, which is trained to closely approximate the true posterior distribution. For data $\mathcal{D}$, the loss function is
\begin{equation}\label{eq:bnninf} 
\max_\vartheta  \mathbb{E}_{q(\theta;\vartheta)}\left[\sum_{i} \log p( y_i | x_i; \theta) \right] - \alpha \mathcal{KL}\left[q(\theta;\vartheta) \| p_0(\theta)\right]
\end{equation}
where $p( y_i | x_i; \theta) $ is the data likelihood and could be a categorical distribution for classification task or Gaussian distribution for regression; the first is also known as expected log-likelihood, which variance could be further reduced by local reparameterization trick \citep{kingma2015variational}. 




\section{FUNCTIONAL WASSERSTEIN VARIATIONAL POLICY OPTIMIZATION}
\label{ProposedModel}


The policy is traditionally parameterized by a deterministic deep neural network in which the final layer outputs parameters of an (action) distribution $\varpi(a|s; \theta)$. 
Although an action distribution is learned, such a design has limited capability to capture the uncertainty of this distribution because of the deterministic structure of the neural network. 
To resolve such an issue, BNN was used to replace the deterministic deep neural network \citep{levine2018reinforcement,liu2017stein}, i.e., $\pi(a|s)=\mathbb{E}_{p(\theta; \vartheta)}\left[ \varpi(a|s; \theta)\right]$. 
However, BNN is only used in weight space by the existing works, which greatly reduces the ability in function flexibility (due to space limitation, more details about the difference between weight-space and function-space can be found in \citep{williams2006gaussian}). Hence, we use BNN as the policy representation but work in the function space rather than the weight space, i.e., $\pi(a|s)=\mathbb{E}_{p(f)}\left[ \varpi(a|f(s))\right]=\mathbb{E}_{p(\theta^f; \vartheta^f)}\left[ \varpi(a|s; \theta^f)\right]$, where $p(f)$ is a functional distribution induced by a parameterized BNN with $p(\theta^f ; \vartheta^f)$. In a nutshell, $p(f)$ can be simply understood as a BNN whose weights are with a distribution parameterized by $\vartheta^f$. More details about the differences between deterministic policy, policy parameterized by BNN in weight space, and policy parameterized by BNN in function space are given in the Supplementary. 

Inspired by the existing functional BNNs \citep{sun2018functional,wang2018function,ma2021functional}, we have the following initial functional variational policy optimization (FVPO), 
\begin{equation}
\begin{aligned}
\max_{q} \quad& \mathbb{E}_{q(f)}\left[ J(f)  \right] 
 -  \alpha \mathcal{KL} \left [q(f) \| p_0(f) \right]
\end{aligned}
\label{fvpo}
\end{equation} 
where $f$ is a policy function (mapping from state to action); $p_0(f)$ is a functional prior, such as Gaussian process \citep{williams2006gaussian}; similar with $q(\theta ; \vartheta)$ in (\ref{eq:bnninf}), $q(f)$ is an approximated functional posterior induced by a parameterized BNN with $q(\theta^f ; \vartheta^f)$; and $J(f)$ is the surrogate term and can be evaluated as $\mathbb{E}_{q(f)}\left[ J(f) \right] = \mathbb{E}_{q(\theta^f ; \vartheta^f)}\left[ J(\theta^f)  \right]$ and $J(\theta^f)$ can be $J^{\text{TRPO}}(\theta^f)$ or $J^{\text{PPO}}(\theta^f)$. 

The first term of FVPO is ordinary so we are more interested in the second functional KL divergence term. Before investigating this functional KL divergence, let us first look at the merits of this functional policy optimization: 1) the optimal function posterior can be directly deduced as  
$q^{*}(f) \propto \exp \left ( J(f) \right) p_0(f)$, but it is unfortunate that we normally do not have an explicit function probability density form to express such posterior easily;  
and 2) optimizing KL divergence between function distributions is hard but doable because there is a link with its marginal KL divergence on measurement set \citep{sun2018functional,gray2011entropy} as demonstrated by Theorem 1 in \citep{sun2018functional} that is $\mathcal{KL}[P \| Q] = \sup_{n \in \mathcal{N}, X \in \mathcal{X}^n} \mathcal{KL}[P_X \| Q_X]$ where $P$ and $Q$ are two stochastic processes defined on space $\mathcal{X}$ and $P_X$ and $Q_X$ are their marginals on $X$ respectively.
 
Although FVPO is a great initial effort to transform to function space variational inference, unfortunately, this functional KL divergence may be an ill-defined objective function sometimes. 
\begin{definition} [Functional KL divergence \citep{gray2011entropy}]
Suppose we have two measures $P$ and $Q$ for $(\Omega, \Sigma)$ and that $P$ is absolutely continuous with respect to $Q$. Then there exists a Radon-Nikodyn derivative $\rm d P /  \rm d Q$ and the KL-divergence between them is $\mathcal{KL}[P \| Q] = \int_\Omega \log \left \{ \rm d P /  \rm d Q \right \} \rm d P$.
\end{definition}

According to the above definition, $\mathcal{KL}[P \| Q]=\infty$ if $P$ is not absolutely continuous with respect to $Q$. 
Besides, \cite{burt2020understanding} also found that $\mathcal{KL}[P \| Q]=\infty$ if the network architectures of prior and approximated posterior are different or prior is a non-degenerate Gaussian process.  




\begin{figure}[!t]
     \centering
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/prior.pdf}
        \caption{Initialization}
         \label{fig:prior}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/prior1.pdf}
        \caption{Initialization}
         \label{fig:prior1}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/Wasserstein.pdf}
        \caption{Wasserstein distance}
         \label{fig:wass}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/Wasserstein1.pdf}
        \caption{Wasserstein distance}
         \label{fig:wass1}
    \end{subfigure}
\\
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/I-projection.pdf}
        \caption{KL divergence}
         \label{fig:kl}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/I-projection1.pdf}
        \caption{KL divergence}
         \label{fig:kl1}
     \end{subfigure}
\caption{Approximation results from different loss. The background contour field is a Gaussian mixture with three components. A single-mode Gaussian distribution is optimized to approximate this mixture distribution using loss defined by three metrics. The initial position of this Gaussian distribution is given in (a) and the approximation results are plotted with contour (red) lines. The detailed setting is given in the Supplementary.}
\label{fig:klw}
\end{figure}


To resolve this issue, we propose to use Wasserstein distance to replace KL divergence, and then we have the following functional Wasserstein variational policy optimization (FWVPO),  
\begin{equation}
\begin{aligned}
\max_{q} \quad& \mathbb{E}_{q(f)}\left[ J(f)  \right] 
 -  \left ( \mathcal{W}[q(f) \| p_0(f)] \right )^2
\end{aligned}
\label{fwvpo}
\end{equation} 
where $\mathcal{W}$ denotes 1-Wasserstein distance\footnote{We only use 1-Wasserstein distance throughout this paper, so $\mathcal{W}$ for the remainder of the paper denotes 1-Wasserstein distance without further notice. } 
\begin{equation}
\begin{aligned}
\mathcal{W}[q(f) \| p_0(f)] = \inf_{p(f, f')} \int c(f, f')p(f, f') \rm d f \rm d f'
\end{aligned}
\end{equation} 
and $p(f, f')$ is any joint distribution with $f \sim q(f)$ and $f' \sim p_0(f)$ as marginals and $c$ is a cost function (metric). It is worth highlighting that (\ref{fwvpo}) is a kind of generalized variational inference (or more general \textit{Rule of Three}) \citep{knoblauch2019generalized} rather than a standard Bayesian inference objective because the loss function term $\mathbb{E}_{q(f)}\left[ J(f)  \right] $ is not a conditional distribution and Wasserstein distance is used instead of KL divergence. 
It is a well-defined distance measure in terms of being positive, symmetric and well-behaved in the situations \citep{ambrogioni2018wasserstein, ARRAS20192341} where KL may be infinite or unbounded due to Radon-Nikodyn derivative \citep{matthews2016sparse}. Note that different from \citep{ambrogioni2018wasserstein} where the Wasserstein distance is defined on parameter distributions, the distance in (\ref{fwvpo}) is defined on the function distributions; different from \citep{wild2022generalized2} where the 2-Wasserstein distance is used to define a functional variational inference, (\ref{fwvpo}) does not simplify the BNN posterior to be a Gaussian process (parameterized by a neural network mean function) which is adopted in \citep{wild2022generalized2} and limits the uncertainty representation capability. 

As also argued in \citep{neumann2011variational} and shown in Figure \ref{fig:klw}, KL divergence could concentrate at a mode of the target distribution when it is a multi-modal or non-concave target policy distribution. However, we argue that KL divergence may collapse at a local mode so is sensitive to the initialization, but the Wasserstein distance could jump out of the local mode to find a better/higher one as shown in Figure \ref{fig:klw}. This is desirable to RL because we hope to search for the best policy during the update rather than collapsing in the local optimum. 




To facilitate the optimization, we use its dual form according to the Kantorovich-Rubinstein duality \citep{villani2009optimal, arjovsky2017wasserstein}, 
\begin{equation}
\begin{aligned}
\mathcal{W}[q(f) \| p_0(f)] = \max_{\|\phi\|_{L\leq1}}  \mathbb{E}_{q(f)}\left[ \phi(f)  \right] 
 -  \mathbb{E}_{p_0(f)}\left[ \phi(f)  \right] 
 \end{aligned}
 \label{wdist}
\end{equation} 
where $\|\phi\|_{L\leq1}$ denotes that $\phi$ is constrained to 1-Lipschitz function. This duality can nicely separate two marginal distributions and the evaluation can be achieved through sampling-based methods. A similar idea is also used for prior matching \citep{Tran2022} but we use it for the posterior variational inference here. There are various ways \citep{tanielian2021approximating,gulrajani2017improved} to approximate a 1-Lipschitz function as a deep neural network to facilitate the (sub)optimal function searching. Here, we use a gradient norm regularizer to ensure $\phi$ is a 1-Lipschitz function \citep{Tran2022, gulrajani2017improved} and then search the space to find the one to maximize (\ref{wdist}). 

Note that any variational Bayesian methods \citep{fox2012tutorial} need to derive a lower bound for the marginal data likelihood (also known as evidence) as the model training objective function. One natural and important question is: Will (\ref{fwvpo}) still be a valid variational objective? In short, can we use "variational" here? We answer this question with the following result.
\begin{theorem}
\label{them:varibound}
Let $p_0(f)$ be a function prior (like the Gaussian process) and parameterize the policy as $\pi(a|s) = \mathbb{E}_{f \sim q(f) }  \left[ \varpi (\cdot | f, s) \right]$ where $q(f)$ is a function distribution induced by BNN weight distributions. 
Given a measurement set $S$, if $-\log p_0(f^S)$ is a Lipschitz function and probability measure $p_0$ absolutely continuous w.r.t. $q$, 
\[ \log p(D) \geq \mathcal{L}^{\mathcal{W}} \geq \mathcal{L}^{\mathcal{KL}}  \]
where 
\begin{equation}
\begin{aligned}
\mathcal{L}^{\mathcal{W}} =& \mathbb{E}_{q(f)}\left[ J(f)  \right] 
 - \frac{\rho}{2} \left ( \mathcal{W} [q(f^S) \| p_0(f^S)] \right )^2, 
 \\
 &\text{s.t.}, H(q) - H(p_0) - \frac{1}{2\rho} \geq 0; \rho > 0
 \end{aligned}
 \label{eq:priorc}
\end{equation}
and 
\[\mathcal{L}^{\mathcal{KL}} = \mathbb{E}_{q(f)}\left[ J(f)  \right] 
 -  \mathcal{KL} [q(f^S) \| p_0(f^S)].\]
and $p(D|f) \propto \exp(J(D, f))$, $\log p(D) = \log \left( \int_f p(D|f) \rm{d} f \right )$.
\end{theorem}

\begin{proof}
Please see the Supplementary. 
\end{proof}

Theorem \ref{them:varibound} confirms that (\ref{fwvpo}) is a lower bound for the marginal data likelihood so it is a valid variational objective and a tighter bound compared with KL divergence.

The other question is: Can (\ref{fwvpo}) still hold the monotonic improvement guarantee like TRPO and is there any improvement compared with KL divergence? We answer this question with the following result.
\begin{theorem}
\label{them:monogua}
Let an old policy (before a training step) is parameterized by a BNN with function prior $p(f)$, i.e., $\pi(a|s) = \mathbb{E}_{f \sim p(f) }  \left[ \varpi(a | s; \theta^f) \right ] $, a new policy is parameterized by a BNN with function prior $\tilde{p}(f)$, i.e., $\tilde{\pi}(a|s) = \mathbb{E}_{f \sim \tilde{p}(f) }  \left[ \varpi(a | s; \theta^f) \right ]$, $\varpi(a | s; \theta)$ defines an action distribution by a (deterministic) neural network parameterized by $\theta$,  
and $L_\pi(\tilde{\pi})$ is the expected reward of $\tilde{\pi}$ evaluated on the trajectory generated by $\pi$. 
if $0 < ||\tilde{\pi}||_1 < \infty, 0 < ||\pi||_1 < \infty$, 
then the following bound holds 
\begin{equation}
\begin{aligned}
\eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi}) - \frac{1}{1-\gamma}  \left( \mathcal{W} \left[\tilde{p}(f) \| p(f) \right] \right )^{2}
\end{aligned}
\label{monogua}
\end{equation}
where $\mathcal{W}^{\text{max}} \left[\tilde{p}(f) \| p(f) \right] = \max_s \mathcal{W} \left[\tilde{p}(f) \| p(f) \right]$ and the equality holds when $\tilde{p}(f) = p(f)$. Moreover, 
\begin{equation}
\begin{aligned}
\eta_{\mathcal{W}} \ge  \eta_{\mathcal{KL}} 
\end{aligned}
\label{relations}
\end{equation}
where $\eta_{KL} = L_\pi(\tilde{\pi}) - \frac{1}{1-\gamma}  \mathcal{KL} \left[\tilde{p}(f) \| p(f) \right]$. 
\end{theorem}


\begin{proof}
We provide the proof in the Supplementary, where we use the relationship between total variation divergence, KL divergence and Wasserstein distance. 
\end{proof}

Theorem \ref{them:monogua} states that the optimization of the right-hand side (RHS) of (\ref{monogua}) can guarantee the monotonic improvement of the expected reward. 
% Remember that $L_{\pi_{\text{old}}}(\tilde{\pi})$ is just $\mathbb{E}_{q(f)}\left[ J(f) \right]$ using the policy ratio in (\ref{trpo}) as the $J(f)$ in  (\ref{fvpo}), so 
The RHS of Theorem \ref{them:monogua} just corresponds to the objective function in (\ref{fwvpo}). The only difference is that the distance is between the prior and posterior in (\ref{fwvpo}) but is between two consecutive posteriors in (\ref{monogua}). We can understand (\ref{fwvpo}) as the initial constraint with no other previous knowledge and (\ref{monogua}) as the continual constraint using updated knowledge about the posterior, or we can also understand (\ref{fwvpo}) as the global constraint and (\ref{monogua}) is the local constraint. 

Another point we need to highlight is that we hope to preserve the stochastic process properties (i.e., marginalization consistency according to Kolmogorov Extension Theorem \citep{oksendal2003stochastic}) of $q(f)$ during the optimization because the marginalization consistency could greatly improve the generalization ability of the learned policy. However, the approximation of using function samples and finite measurement sets may damage such properties. 
To further ensure the marginalization consistency of $q(f)$, we then propose to minimize the distance between the marginal $q_j(f(Y))$ of a joint distribution $q(f(Y, U))$ using samples $(Y,U)$ and $q_m(f(Y))$ only using samples $Y$ by
\begin{equation}
\begin{aligned}
&\mathcal{W}_Y \left[q_m(f(Y)), q_j(f(Y)) \right] =
% & = \max_{\|\phi\|_{L \leq 1}} \quad \left | \mathbb{E}_{q_m(f(Y))}\left[ \phi(f(Y))  \right] 
%  -  \mathbb{E}_{q_j(f(Y))}\left[ \phi(f(Y))  \right] \right |=
 \\
 & \max_{\|\phi\|_{L \leq 1}}  \left | \frac{1}{N_j}\sum_{i=1:N_j} \phi(f_{j,i}(Y))  - \frac{1}{N_m}\sum_{i=1:N_m} \phi(f_{m,i}(Y))  \right |
 \end{aligned}
 \label{margcons}
\end{equation} 
where $f_{j,i}$ and $f_{m,i}$ are both function samples from $q(f)$ but with no overlap and we use subscripts to distinguish them. Since we have the samples of the joint distribution $q(f(Y, U))$, it is easy to obtain its marginal on $Y$ by just throwing $U$ away. Ideally, all possible $Y$ would be better evaluated using the above formula, but the combinatorial number is too large, so we only (uniformly) randomly sample several sets instead. 

To sum up, we integrate the three terms gradually proposed above to obtain our final FWVPO,
\begin{equation}
\begin{aligned}
\max_{q} & \quad \mathbb{E}_{q(f)}\left[ J(f)  \right] 
- \alpha_1 \left(\mathcal{W} \left[{q}_{\text{old}}(f) \| q(f) \right] \right )^2
\\
&\quad\quad\quad -  \alpha_2 \left(\mathcal{W}[q(f) \| p_0(f)] \right )^2
 \\
&\quad\quad\quad - \alpha_3 \mathcal{W}_Y \left[q_m(f(Y)), q_j(f(Y))
 \right]
 \\
  \text{s.t.}& \quad H(q) - H(p_0) - \frac{1}{2\rho} \geq 0
\end{aligned}
\label{finalfwvpo}
\end{equation} 
where $\{\alpha>0\}$ and $\rho>0$ are hyperparameters. The first regularizer corresponds to the monotonic improvement property from (\ref{monogua}); the second regularizer corresponds to the prior constraint from (\ref{eq:priorc}); the last is to enhance marginalization consistency from (\ref{margcons}). It is interesting to see that the prior can be considered as a \textit{global} and \textit{static} constraint while the $q_{old}$ can be considered as a \textit{local} and \textit{dynamic} constraint. 
In practice, we use a finite measurement set $S\in \mathcal{S}^{k}$ to evaluate the above objective functions (\ref{finalfwvpo}) according to (\ref{wdist}): $\mathbb{E}_{q(f)}\left[ \phi(f(S))  \right] - \mathbb{E}_{p_0(f)}\left[ \phi(f(S))  \right]$ where $k$ is size of the measurement set. 
% We firstly propose to randomly select measurement set (states) to cover the state space as much as possible and then approximate the function distribution as much as possible. 
For each training step, we first add a batch of training episodes in local on-policy memory buffer to the measurement set and randomly select a set from a global pool that stores all visited states. 
The core part is summarised in Algorithm 1 (see Supplementary). 


\section{RELATED WORK}
\label{relatedwork}


Policy optimization algorithms can be roughly grouped according to gradient and model usage: gradient-based methods \citep{williams1992simple,li2021softmax,drpg} and gradient-free methods \citep{szita2006learning}; model-based and model-free methods. The focus of this paper is on model-free gradient-based policy optimization only. Inspired by the conservative policy iteration \citep{cpi}, the TRPO \citep{schulman2015trust} and PPO \citep{schulman2017proximal} were proposed to generalize the idea to general stochastic policies and obtained the state-of-the-art performance. Following these studies, several ingenious ideas were proposed, such as a new clipping function to improve the performance stability \citep{wang2020truly}, an additional estimate of the expected return given a policy parameter using Gaussian process to encourage exploration \citep{rao2020gaussian}, convergence analysis of policy optimization algorithms using mirror descent iteration and momentum techniques \citep{huang2021bregman}, and so on. 
Variational policy optimization is an interesting branch that borrows the approximate Bayesian inference techniques to improve the uncertainty modeling and generalization capability. One popular approximate Bayesian inference is variational inference, such as MEPO \citep{levine2018reinforcement,liu2017stein}, which naturally transformed the probabilistic inference as an optimization problem with an additional KL divergence between approximate posterior and prior. Since the KL divergence is not symmetric, its reversed version \citep{neumann2011variational} was also used to force the policy to be `cost-averse' rather than `reward-attracted' but lost the original lower bound property. Apart from variational inference \citep{blei2017variational}, particle filtering was also used to develop the Stein variational policy gradient \citep{liu2017stein} to directly minimize the KL divergence between the optimal posterior distribution and prior through a series of iterative transformed approximate distributions. Another similar idea \citep{xu2018variational} used amortized variational inference to resolve KL divergence through a general invertible transformation. 

Apart from KL divergence, there are also studies trying to use the Wasserstein distance for policy optimization \citep{terpin2022trust}. Such distance has been used to constrain the distance between before and after transition probabilities \citep{abdullah2019wasserstein} and the before and after return distributions \citep{li2021bayesian}. For policy distance regularization, Wasserstein distance was used to measure the distance between behavioral policy embeddings which encodes global behaviors of policy rather than local action selection \citep{pacchiano2020learning}. To facilitate the policy optimization under Wasserstein regularization,  one idea was to link policy optimization with Wasserstein gradient flow, and then a particle approximation method was proposed to estimate such Wasserstein gradient flow \citep{zhang2018policy}; another idea was to use the second-order Taylor expansion of Wasserstein distance to characterize the local behavioral structures \citep{moskovitz2020efficient}; the latest idea \citep{songefficient} was to derive a closed-form of the policy update based on the Lagrangian of the constrained optimization. Apart from policy optimization, Wasserstein distance was also used for other reinforcement learning tasks, such as reward function learning \citep{pmlr-v89-zhang19a}. Note that although the Wasserstein gradient flow used in \citep{pmlr-v89-zhang19a} can be considered a kind of functional optimization/sampling on the probability measure space, the samples obtained from this reference are still within parameter space. Extending it to the function space requires specific designs \citep{wang2018function}. Most of these works were in the parameter space rather than the function space considered in our work. 



\begin{figure*}[!t]
     \centering
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_Hopper-v2_fig_4.pdf}
        % \caption{two-layer BNN}
         % \label{fig:acro2}
     \end{subfigure}
     % \hfill
     % \hspace{0.3cm}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_Humanoid-v2_fig_4.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_Walker2d-v2_fig_4.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_HalfCheetah-v2_fig_4.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
\caption{Average episode rewards on four MuJoCo environments.}
\label{fig:mujoco}
\end{figure*}

\begin{figure*}[!t]
     \centering
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_noise_Hopper-v2_fig_0.pdf}
        % \caption{two-layer BNN}
         % \label{fig:acro2}
     \end{subfigure}
     % \hfill
     % \hspace{0.3cm}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_noise_Humanoid-v2_fig_0.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_noise_Walker2d-v2_fig_0.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_noise_HalfCheetah-v2_fig_0.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
\caption{Average episode rewards on four MuJoCo environments with noises.}
\label{fig:mujoco-noise}
\end{figure*}

\begin{figure*}[!t]
     \centering
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_gen_Hopper-v2_fig_0.pdf}
        % \caption{two-layer BNN}
         % \label{fig:acro2}
     \end{subfigure}
     % \hfill
     % \hspace{0.3cm}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_gen_Humanoid-v2_fig_0.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_gen_Walker2d-v2_fig_0.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_gen_HalfCheetah-v2_fig_0.pdf}
         % \caption{one-layer BNN}
         % \label{fig:acro1}
     \end{subfigure}
\caption{Average episode rewards on four MuJoCo environments under four variations.}
\label{fig:mujoco-change}
\end{figure*}


\section{EXPERIMENTS}
\label{Experiments}

We designed our experiments to investigate the following questions: 1) How do different policy parameterizations and prior-posterior distance choices affect the performance of the algorithm? 2) what is the advantage of modeling uncertainty using function distribution? Our code\footnote{https://github.com/JunyuXuan/FWVPO} is released for reference.


\begin{figure}[!t]
\centering
\includegraphics[width=0.45\textwidth]{aistats/figure/PPO_CartPole-v1_fig_0-cropped.pdf}
\caption{Average episode rewards from different algorithms on CarPole.}
\label{fig:cart3}
\end{figure}



\subsection{Basic setup}

We used \textbf{PPO} as the base model and its clipped objective term $J(f)$ for all comparative methods. We then implemented the policy optimization with BNN as the policy parameterization \textbf{BNN-PPO} and its extensions with KL divergence \textbf{BNN-KL-PPO} and Wasserstein distance \textbf{BNN-W-PPO} \citep{pacchiano2020learning,zhang2018policy} between action distributions, respectively. We also implemented the policy optimization with functional BNN as the policy parameterization and its extensions with KL divergence \textbf{fBNN-KL-PPO (FKVPO)} and Wasserstein distance \textbf{fBNN-W-PPO (FWVPO)} between function distributions, respectively. All algorithms are given the same hyperparameters whose details can be found in the Supplementary. 

\subsection{Effect of different policy parameterizations and prior-posterior distance choices}








The learning curves of algorithms on classical gym environments are shown in Figures \ref{fig:cart3} and \ref{fig:mujoco}, where the x-axis denotes time steps and the y-axis denotes the average episode reward. From Figure \ref{fig:cart3}, we can see that the PPO quickly converged after a small number of steps but became unstable along the training after 2e6 steps. The possible reason is that the PPO fell into a local minimum so its performance dropped after exploring to new state space. Compared with PPO, BNN-PPO had a slow convergence rate in Figure \ref{fig:cart3}. The reason for that is the BNN parameterized policy that will learn a distribution of the functions rather than a single function by a deterministic neural network used in PPO, which apparently needs more samples. In the more complex MuJoCo environments, the convergence rate is not worse than PPO  as shown in Figure \ref{fig:mujoco}. The merits of such distribution learning will be demonstrated in the following sections. We observed that BNN-W-PPO was better than both of them due to the Wasserstein distance as we expected. Two functional BNN-based algorithms achieved better performances than all parametric BNN-based ones, where fBNN-W-PPO was slightly better than fBNN-KL-PPO. The reason for that is that we used grid KL divergence \citep{ma2021functional} between function distributions which was proven to be bounded. 


\subsection{Robustness to noisy observations}



\begin{figure}[!t]
     \centering
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_noise_CartPole-v1_fig_0-cropped.pdf}
         \caption{CartPole with noises}
         \label{fig:cart}
     \end{subfigure}
     % \hfill
     % \hspace{0.3cm}
     \begin{subfigure}[t]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{aistats/figure/PPO_noise_Acrobot-v1_fig_0-cropped.pdf}
         \caption{Acrobot with noises}
         \label{fig:acro}
     \end{subfigure}
\caption{Rewards on two noisy environments}
\label{fig:noise}
\end{figure}


One merit of uncertainty modeling\footnote{There are some works on decomposing aleatoric and epistemic uncertainties for specific tasks, but we did not decompose two kinds of uncertainty and only focused on the general and basic policy optimization task.} is the ability to handle noises. To verify this, we injected (multivariate Gaussian distributed) random noises into an environment (the details of the setup can be found in the Supplementary). The results of PPO and FWVPO on noisy environments are given in Figures \ref{fig:noise} and \ref{fig:mujoco-noise}. We can see that the rewards from PPO dramatically dropped from -100 (reward from no-noise Acrobat) to -300 (noisy Acrobot) and 500 (reward from no-noise CarPole) to 60 (noisy CarPole) as shown in Figure \ref{fig:acro}. 
On MuJoCo environments, we can also observe that our FWVPO could obtain consistently higher rewards when facing the same noises as PPO in all four environments as shown in Figure \ref{fig:mujoco-noise}.
It not only converged to a higher reward than PPO but also obtained a faster convergence rate on Acrobot and a comparable rate on CartPole and MuJoCo. 
We need to highlight that, unlike weakly supervised learning studies \citep{zhou2018brief}, there was no specific component or strategy designed in FWVPO for noises. Such ability is totally from the uncertainty modeling ability. 


\subsection{Generalization to environment variations}


\begin{figure}[!t]
% \centering
\begin{minipage}[t]{.45\textwidth}
  \centering
  % \includegraphics[width=\textwidth]{figure/PPO_gen_CartPole-v1_fig_0-cropped.pdf}
  \includegraphics[width=\textwidth]{aistats/figure/Presentation2.png}
         \caption{Evaluation on environment variations}
         \label{fig:gen}
\end{minipage}%
\hspace{0.5cm}
\begin{minipage}[t]{.45\textwidth}
  \centering
  % \includegraphics[width=\textwidth]{figure/PPO_CartPole-v1_fig_2bnn_fwppo_all_0-cropped.pdf}
  \includegraphics[width=\textwidth]{aistats/figure/Presentation1.png}
         \caption{Contributions from three different terms}
         \label{fig:ablation}
\end{minipage}
\end{figure}

The other merit of uncertainty modeling is the ability to generalize to environmental variations. To verify this, we will test the pre-trained algorithms on variated environments without further training. 
The average results on ten episodes in CarPole are shown in Figure \ref{fig:gen} where the x-axis denoted the revised 5 environments with different change degrees compared to the basic CarPole and the larger number means the larger change; and the y-axis denoted the obtained average reward for 10 episodes, and the one standard deviation was also filled there. More results on MuJoCo environments are given in Figure \ref{fig:mujoco-change} with four variations. Please see the Supplementary for detailed explanations of the designed variants. From Figure \ref{fig:gen}, we can observe that the rewards from PPO dramatically dropped from 500 (original environment) to around 300 starting from the first variation which was the smallest change. As increasing the variation, the performances kept dropping to a very low level and the standard deviation was surprisingly decreased as well. The shadow denotes the confidence of the algorithm on the prediction/performance, so the small shadows around the 4th and 5th environments denoted PPO failed on them and did not know its failure. On the contrary, our FWVPO was still very stable and obtained high rewards in variated environments. There was only a relatively small decrease from the 4th environment but a large variance was correctly given to such decrease, which could support the following safe decision making. From Figure \ref{fig:mujoco-change}, we can also see that FWVPO consistently outperformed PPO on all four environments in terms of all variants. 
This fragility of PPO strongly motivates us to move to variational policy optimization.


\subsection{Ablation study}


We further studied the contributions from three terms in (\ref{finalfwvpo}): distance with prior, distance with old posterior, and distance between marginal distributions. The ablation results are shown in Figure \ref{fig:ablation}. At first, we can see that removing distance with the old posterior or distance between marginal distributions decreased the performance and the importance of distance with the old posterior was slightly higher than the marginal one. We also see that removing distance with prior did not decrease but slightly improved the performance. The reason is that we gave a non-informative prior in the experiments for simplicity. However, we could provide more meaningful prior practice by pretraining a prior using the randomly collected interactions or some other prior knowledge of the policy or environment. More parameter analysis can be found in the Supplementary.  

\section{CONCLUSIONS and future studies}
\label{Conclusions}

In this paper, we proposed a functional Wasserstein variational policy optimization (FWVPO) for reinforcement learning based on 1-Wasserstein distance instead of KL divergence and 2-Wasserstein distance. This new algorithm was empirically shown to have good capability in uncertainty modeling and generalization ability. We proved that FWVPO is a valid and tighter variational Bayesian objective and can also guarantee the monotonic expected reward improvement under certain conditions. Comparative experiments with several baselines on benchmark reinforcement learning tasks verified the proposed idea. In the future, we are going to further evaluate the proposed idea on the model-based RL where the functional BNNs would be used as the environment model \citep{lee2018bayesian} to increase the uncertainty modelling capability. Another interesting point is to investigate the possibility of properly expressing the `probability density' for the function distribution. 


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
This work is supported by the Australian Research Council under the Discovery Early Career Researcher Award DE200100245 and Laureate Fellowships FL190100149.
\end{acknowledgements}

% References
% \bibliography{uai2024-template}
\bibliography{references}


\newpage

\onecolumn

\title{Functional Wasserstein Variational Policy Optimization\\(Supplementary Material)}
\maketitle

\appendix
\section{Setting for Figure 1 in the paper}

We first define a Gaussian mixture model (GMM) as our target distribution,
\begin{equation}
    p(x) = 0.1 * \mathcal{N}\left(x; \begin{bmatrix}
0 \\
0 
\end{bmatrix}, 
\begin{bmatrix}
2 & 0 \\
0 & 2 \\
\end{bmatrix} \right ) + 
0.2 * \mathcal{N}\left(x; \begin{bmatrix}
20 \\
20 
\end{bmatrix}, 
\begin{bmatrix}
3 & 0 \\
0 & 3 \\
\end{bmatrix} \right ) +
0.7 * \mathcal{N}\left(x; \begin{bmatrix}
-10 \\
20 
\end{bmatrix}, 
\begin{bmatrix}
1 & 0 \\
0 & 1.5 \\
\end{bmatrix} \right )
\end{equation}
where three components are included with corresponding weights. The log-likelihood contour field is plotted in Figure \ref{fig:klw}. We then use a Gaussian distribution 
\begin{equation}
    q(x) = 0.1 * \mathcal{N}\left(x; \mu, 
\begin{bmatrix}
5 & 0 \\
0 & 5 \\
\end{bmatrix} \right )
\end{equation}
to approximate the above-defined GMM distribution, where $\mu$ is the mean parameter that needs to be optimized. Finally, we use KL divergence ($\mathcal{KL}[q || p]$) and Wasserstein distance ($W_1[q, p]$) as the loss function to optimize $\mu$, respectively. The other hyperparameters are the same for all, like optimizer, steps, and learning rates. 


The results are shown in Figure \ref{fig:klw}, where we set two different initializations (Figures \ref{fig:prior} and \ref{fig:prior1}). We can see that 
\begin{itemize}
    \item KL divergence is sensitive to the initialization. For different initializations, there are two different results (Figures \ref{fig:kl} and \ref{fig:kl1}) from KL divergence. In contrast, the results from Wasserstein distance (Figures \ref{fig:wass} and \ref{fig:wass1}) are the same under different initializations. 
    \item KL divergence tends to converge to the local optimal mode, which is consistent with the findings in \citep{neumann2011variational}. 
    \item Wasserstein could jump out of the local optimum and move close to the global optimal mode (which is the up-left corner one with the darkest color in Figure \ref{fig:klw}). 
\end{itemize}


\section{Notation table}

\begin{table}[h]
  \caption{Notation table}
  \label{Notation}
  \centering
  \begin{tabular}{lp{13cm}}
    \toprule
    Notation     & Description     \\
    \midrule
    $p_0(f)$ & functional prior distribution       \\
    $q(f)$ & functional variational posterior  distribution     \\
    $\theta$ & policy parameters, e.g., neural network weights for deterministic policy parameterization\\
    $\vartheta$ &  neural network weights' distribution parameters for BNN policy parameterization \\
    $\vartheta^f$ &  the neural network weights' distribution parameters for a BNN that induces a functional distribution $q(f)$  \\
    $S$ & measurement set \\
    \bottomrule
  \end{tabular}
\end{table}

In Table \ref{Notation}, we list the notations used in the paper and a description of their representation. 


\section{Some standard concepts of reinforcement learning}

The definitions of standard concepts follow the ones in TRPO \citep{schulman2015trust}, including the state-action value function 
\[ Q_{\pi}(s, a) = \mathbb{E}_{s_{t+1}, a_{t+1}, \ldots} \left [ \sum_{l=0}^{\infty} \gamma^l r(s_{t+l}) \right ] \]
and state value function 
\[ V_{\pi}(s) = \mathbb{E}_{a_t, s_{t+1},\ldots} \left [ \sum_{l=0}^{\infty} \gamma^l r(s_{t+l}) \right ]\]
and the advantage function 
\[A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s).\]

\section{More details about different policy parameterizations}

\begin{itemize}
    \item \textbf{Policy parameterized by a deterministic neural network.} The policy is traditionally parameterized by a deterministic deep neural network in which the final layer outputs parameters of an (action) distribution 
    \begin{equation}
    \begin{aligned}
    \varpi(a|s; \theta)
    \end{aligned}
    \end{equation} 
    where $\theta$ is neural network weights. 
    \item \textbf{Policy parameterized by a BNN in weight space.} BNN was used to replace the deterministic deep neural network \citep{levine2018reinforcement,liu2017stein}, 
        \begin{equation}
    \begin{aligned}
    \pi(a|s)=\mathbb{E}_{p(\theta|\vartheta)}\left[ \varpi(a|s; \theta)\right]
    \end{aligned}
    \end{equation} 
    where $p(\theta|\vartheta)$ is the distribution of neural network weights, such as commonly used i.i.d. Gaussian distributions.  
     \item \textbf{Policy parameterized by a BNN in function space.} We use BNN as the policy representation but work in the function space rather than the weight space,
         \begin{equation}
    \begin{aligned}
    \pi(a|s)=\mathbb{E}_{p(f)}\left[ \varpi(a|f(s))\right]=\mathbb{E}_{p(\theta^f | \vartheta^f)}\left[ \varpi(a|s; \theta^f)\right]
        \end{aligned}
    \end{equation} 
     where $p(f)$ is a functional distribution induced by a parameterized BNN with $p(\theta^f | \vartheta^f)$. Since it is hard to represent a function for BNN, it is commonly adopted to use BNN weights to represent a function because there is a mapping between $f$ and $\theta^f$ and then a mapping between $p(f)$ and $p(\theta^f | \vartheta^f)$. In a nutshell, $p(f)$ can be simply understood as a BNN whose weights are with a distribution parameterized by $\vartheta^f$. It is important to note that although the calculation/evaluation of $\mathbb{E}_{p(f)}\left[ \varpi(a|f(s))\right]$ and $\mathbb{E}_{p(\theta^f | \vartheta^f)}\left[ \varpi(a|s; \theta^f)\right]$ looks like a policy parameterization in weight space, it is different from weight space parameterization in terms of regularization, such as KL divergence of (7) or Wasserstein distance of (8) in the paper. Similarly, the evaluation of $\mathbb{E}_{q(f)}\left[ J(f)  \right]$ of (7) and (8) in the paper also uses the inducing distribution,
              \begin{equation}
    \begin{aligned}
    \mathbb{E}_{q(f)}\left[ J(f) \right] = \mathbb{E}_{q(\theta^f | \vartheta^f)}\left[ J(\theta^f)  \right]
            \end{aligned}
    \end{equation} 
     where $J(\theta^f)$ can be $J^{\text{TRPO}}(\theta^f)$ or $J^{\text{PPO}}(\theta^f)$. 
\end{itemize}

\begin{algorithm}[!t]
\label{alg}
\begin{algorithmic}[1]
\caption{FWVPO}
    \STATE \textbf{Require} pool $\mathcal{M}$, and memory buffer $\mathcal{B}$, 
    % \STATE Compute advantage $A(s, a)$ using $\mathcal{B}$ \\
    \STATE Initialize a GP prior $G_0$ and a BNN parameterized by $\vartheta$ \\
    \STATE Initialize three Lipschitz functions $\phi_{\varphi'}, \phi_{\tilde{\varphi}}, \phi_{\hat{\varphi}}$ parameterized by $\varphi', \tilde{\varphi}, \hat{\varphi}$, respectively
    \FOR{ $t=0,1, \ldots$}
        \STATE{ Draw a measurement set $S$ from the pool $\mathcal{M}$}
        \STATE{ Combine $S = \{\mathcal{B}, S\}$ }
        \STATE{ Draw $N$ functions from $G_0$ on $S$, $ \{f'_i(S)\}_{i=1:N} $}
        \STATE{ Draw $N$ functions from ${q}_{\text{old}}(f)$ on $S$, $ \{\tilde{f}_i(S)\}_{i=1:N} $}
        \STATE{ Draw $N+M$ functions from $q_\vartheta(f)$ on $S$, $ \{{f}_i(S)\}_{i=1:N} $and $ \{{f}_j(S)\}_{j=1:M} $} 
        \STATE{ Update $\varphi'$ by 
        \[\argmax_{\varphi'} \left | \frac{1}{N}\sum_{i=1:N} \phi_{\varphi'}(f_i(S))
 -  \frac{1}{N}\sum_{i=1:N} \phi_{\varphi'}(f'_i(S))  \right |, ~\text{s.t.}~  \|\phi_{\varphi'}\|_{L\leq1} \] }
        \STATE{ Update $\tilde{\varphi}$ by 
        \[\argmax_{\tilde{\varphi}} \left | \frac{1}{N}\sum_{i=1:N} \phi_{\tilde{\varphi}}(f_i(S))
 -  \frac{1}{N}\sum_{i=1:N} \phi_{\tilde{\varphi}}(\tilde{f}_i(S))  \right |, ~\text{s.t.}~  \|\phi_{\tilde{\varphi}}\|_{L\leq1} \] }
 \STATE{ Update $\hat{\varphi}$ by 
        \[\argmax_{\hat{\varphi}} \left | \frac{1}{N}\sum_{i=1:N} \phi_{\hat{\varphi}}(f_i(S))
 -  \frac{1}{M}\sum_{j=1:M} \phi_{\hat{\varphi}}({f}_j(S))  \right |, ~\text{s.t.}~  \|\phi_{\hat{\varphi}}\|_{L\leq1} \] }
        \STATE{ Update $\vartheta$ by (11) in the paper
        }
    \ENDFOR
\end{algorithmic}
\end{algorithm}


The pseudo-code of the whole procedure is briefly summarised in Algorithm 1.



\section{Proof for Theorem 1 in the paper}

 
\begin{proof}

We prove the first inequality as below. 

\begin{equation}
\begin{aligned}
&\mathbb{E}_{q(f^S)}\left[ J(f^S)  \right] - \frac{\rho}{2} \mathcal{W}^2[q(f^S) \| p_0(f^S)] 
\\
=&\mathbb{E}_{q(f^S)}\left[ J(f^S)  \right] - \mathcal{KL}[q(f^S) \| p_0(f^S)] + \mathcal{KL}[q(f^S) \| p_0(f^S)] - \frac{\rho}{2} \mathcal{W}^2[q(f^S) \| p_0(f^S)] 
\\
=&\log p(D) - \mathcal{KL}[q(f^S) \| p(f^S |D)] + \mathcal{KL}[q(f^S) \| p_0(f^S)] -  \frac{\rho}{2} \mathcal{W}^2[q(f^S) \| p_0(f^S)] 
\\
=&\log p(D) - \left ( \mathcal{KL}[q(f^S) \| p(f^S |D)] - \mathcal{KL}[q(f^S) \| p_0(f^S)] + \frac{\rho}{2} \left ( \max_{\|\phi\|_{L \leq 1}} \left \{ \mathbb{E}_{q(f^S)}[\phi(f^S)] - \mathbb{E}_{p_0(f^S)}[\phi(f^S)] \right \} \right )^2 \right )
\end{aligned}
\end{equation}
Next, we only need to show the second term is positive. If $-\log p_0(f^S)$ is a special $\phi(f^S)$, we have 
\begin{equation}
\begin{aligned}
&\mathcal{KL}[q(f^S) \| p(f |D)] - \mathcal{KL}[q(f^S) \| p_0(f^S)] +  
\frac{\rho}{2} \left ( \max_{\|\phi\|_{L \leq 1}} \left \{ \mathbb{E}_{q(f^S)}[\phi(f^S)] - \mathbb{E}_{p_0(f^S)}[\phi(f^S)] \right \} \right )^2
\\
\geq & - \mathbb{E}_{q(f^S)}[\log p(f^S|D)] + \mathbb{E}_{q(f^S)}[\log p_0(f^S)] 
+ \frac{\rho}{2} \left (\mathbb{E}_{q(f^S)}[\log p_0(f^S)] + \mathbb{E}_{p_0(f^S)}[\log p_0(f^S)] \right )^2
\\
= &  \frac{\rho}{2} \left ( \mathbb{E}_{q(f^S)}[\log p_0(f^S)] 
- \mathbb{E}_{p_0(f^S)}[\log p_0(f^S)] + \frac{1}{\rho} \right )^2
+ \mathbb{E}_{p_0(f^S)}[\log p_0(f^S)] - \mathbb{E}_{q(f^S)}[\log p(f^S|D)] - \frac{1}{2\rho}
\\
\geq & \frac{\rho}{2} \left ( \mathbb{E}_{q(f^S)}[\log p_0(f^S)] 
- \mathbb{E}_{p_0(f^S)}[\log p_0(f^S)] + \frac{1}{\rho} \right )^2
+ H(q) - H(p_0) - \frac{1}{2\rho}
\\
\geq & 0 
\end{aligned}
\end{equation} 

The second inequality is from the Talagrand inequality (It is from Definition 1 of the paper \citep{otto2000generalization} which further sources from Theorem 1.1 of the paper \citep{talagrand1996transportation}, and note that when the cost function is not with a square, the inequality should be Eq (1.2) of \citep{talagrand1996transportation}): the probability measure $q$ satisfies a Talagrand inequality with constant $\rho>0$ if for all probability measure $p$, absolutely continuous w.r.t. $q$, with finite moments of order 2,
\begin{equation}
    W_1(p, q) \leq \sqrt{\frac{2 \mathcal{KL}[p\|q]}{\rho}}. 
\end{equation}
Then, we can easily see that $\mathcal{L}^{\mathcal{W}} \geq \mathcal{L}^{\mathcal{KL}}$. 


\end{proof}


\section{Proof for Theorem 2 in the paper}



\begin{proof}
Following TRPO \citep{schulman2015trust}, we define
\begin{equation}
\begin{aligned}
\eta(\tilde{\pi}) &= \eta(\pi) + \mathbb{E}_{\tau \sim \tilde{\pi} }\left[\sum_{t=0}^\infty \gamma^t A_\pi (s_t, a_t) \right ]
\\
 L_\pi(\tilde{\pi}) &= \eta(\pi) + \mathbb{E}_{\tau \sim \pi } \left [\sum_{t=0}^\infty \gamma^t A_\pi (s_t, a_t) \right ]
\end{aligned}
\end{equation} 
and we know that
\begin{theorem}[Theorem 2.1 in \citep{CHAE2020108771}]
\[ \left ( TV(\bar{\pi}, \pi) \right )^{(\beta + 1)/\beta} \leq c\left(\|\bar{\pi}\|_{H_1^\beta} + \|\pi\|_{H_1^\beta} \right)^{1/\beta} \mathcal{W}(\bar{\pi}, \pi)\]
\end{theorem}
where $c > 0$ is a constant, $\beta \in \mathbb{N}$ is independent with two distributions, and $\|f\|_{H_1^\beta} = \|f\|_1 + \|\nabla^{\beta}f\|_1$. 

According to the above theorem, when the $\beta \to \infty$, we have 
$TV(\bar{\pi}, \pi) \leq c \mathcal{W}(\bar{\pi}, \pi)$. Then, 
\begin{equation}
\begin{aligned}
\eta(\pi_{new}) &\geq L_{\pi_{old}}(\pi_{new}) - \frac{4\gamma \epsilon}{(1-\gamma)^2} \left(TV^{\text{max}} (\pi_{old} \| \pi_{new}) \right)^2
\\
&\geq L_{\pi_{old}}(\pi_{new}) - \frac{4\gamma \epsilon}{(1-\gamma)^2} \lambda(1) \left(\mathcal{W}^{\text{max}}(\pi_{old} \| \pi_{new}) \right)^2
\\
&= L_{\pi_{old}}(\pi_{new}) - \frac{4\gamma \epsilon}{(1-\gamma)^2} \lambda(1) \left( \sup_\phi \left ( \mathbb{E}_{f \sim \tilde{p}(f)} \mathbb{E}_{a \sim \varpi(a;\theta^f) } \left[ \phi (a) \right] - \mathbb{E}_{f \sim p(f)} \mathbb{E}_{a \sim \varpi(a;\theta^f) }  \left[ \phi (a) \right] \right)\right)^2
\end{aligned}
\end{equation} 
where $\epsilon = \max_{s,a}|A_{\pi}(s,a)|$, $\lambda(1) = c^2$ where $1$ denotes $\beta=1$; and we ignore the coefficient without loss of generalization because it can be easily adjusted to match the coefficient. 
Since $\phi$ can be any Lipschitz function, we next assume that the optimal one to maximize $\mathbb{E}_{f \sim \tilde{p}(f)} \mathbb{E}_{a \sim \varpi(a;\theta^f) } \left[ \phi (a) \right] - \mathbb{E}_{f \sim p(f)} \mathbb{E}_{a \sim \varpi(a;\theta^f) }  \left[ \phi (a) \right]$ is $\phi^*$ and $\mathbb{E}_{a \sim \varpi(a;\theta^f) } \left[ \phi^* (a) \right]$ is also Lipschitz function of $f$. Then, we have
\begin{equation}
\begin{aligned}
&L_{\pi_{old}}(\pi_{new}) - \frac{4\gamma \epsilon}{(1-\gamma)^2} \lambda(1) \left ( \sup_\phi \left ( \mathbb{E}_{f \sim \tilde{p}(f)} \mathbb{E}_{a \sim \varpi(a;\theta^f) } \left[ \phi (a) \right] - \mathbb{E}_{f \sim p(f)} \mathbb{E}_{a \sim \varpi(a;\theta^f)  }  \left[ \phi (a) \right] \right) \right)^2
\\
\geq & L_{\pi_{old}}(\pi_{new}) - \frac{4\gamma \epsilon}{(1-\gamma)^2} \lambda(1) \left ( \mathcal{W}^{\text{max}} \left[\tilde{p}(f), p(f) \right] \right)^2
\end{aligned}
\label{eq25}
\end{equation}
which proves the Theorem with only a difference in the coefficient of the Wasserstein distance term. We can easily absorb $4\gamma \epsilon$ into $\phi$ function definition and then obtain the same results.  

Note that both $p(f)$ and $\tilde{p}(f)$ do not depend on $s$, but $\mathcal{W}$ depends on it because $\phi(s)=\mathbb{E}_{\varpi(a | s; \theta^f) } [ A_\varpi (s, a) ]$ depends on $s$. The `max' in (\ref{eq25}) is not a big problem because we can remove it. The reason is that we assumed all $\phi(s)$ are Lipschitz functions, so $\mathcal{W}$ is the supremum of distances defined by all candidate $\phi(s_t)$ for all $t$. The reason why we kept it here is to ease the comparison with TRPO. 

For the second half of the Theorem (the relationship between KL divergence), we use Talagrand inequality \citep{otto2000generalization} again: the probability measure $q$ satisfies a Talagrand inequality with constant $\rho$ if for all probability measure $p$, absolutely continuous w.r.t. $q$, with finite moments of order 2,
\begin{equation}
    W_1(p, q) \leq \sqrt{\frac{2 \mathcal{KL}[p\|q]}{\rho}}. 
\end{equation}
Then, we can easily see that 
\begin{equation} 
\begin{aligned}
\eta_{KL} =& L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{4\gamma \epsilon}{(1-\gamma)^2} \mathcal{KL}[\pi_{\text{old}} \| \pi_{\text{new}}]
\\
\leq & L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{4\gamma \epsilon}{(1-\gamma)^2} \frac{\rho}{2}\left ( \mathcal{W}[p_\text{old}(f) \| p_\text{new}(f)] \right )^2
\\
=& \eta_{\mathcal{W}}.
\end{aligned}
\end{equation}





\section{More details for the experiments in the paper}


\subsection{Setup details}


\begin{table}[!t]
  \caption{Hyperparameters for CartPole and Acrobot}
  \label{hyper-table}
  \centering
  \begin{tabular}{ll}
    \toprule
    \textbf{Name}     & \textbf{Value}    \\
    \midrule
    max time steps in one episode & 500    \\
    update policy frequency     & 2,000  (unless otherwise specified)    \\
    number of epochs for policy update   & 80 \\
    number of steps for Lipschitz function maximization  & 10 \\
    clip parameter for PPO   & 0.2 \\
    discount factor $\gamma$   & 0.99 \\
    activation function   & Tanh \\
    learning rate for actor network & 0.0003 \\
    learning rate for critic network & 0.001 \\
    learning rate for Lipschitz function & 0.01 \\
    random seed & 12 (unless otherwise specified)  \\
    \bottomrule
  \end{tabular}
\end{table}


\begin{table}[!t]
  \caption{Hyperparameters for MuJoCo experiments}
  \label{hyper-table-mujoco}
  \centering
  \begin{tabular}{ll}
    \toprule
    \textbf{Name}     & \textbf{Value}    \\
    \midrule
    max time steps in one episode & 2048    \\
    update policy frequency     & 5 episodes     \\
    number of epochs for policy update   & 10 \\
    number of steps for Lipschitz function maximization  & 10 \\
    clip parameter for PPO   & 0.2 \\
    discount factor $\gamma$   & 0.99 \\
    activation function   & Tanh \\
    learning rate for actor-network & 0.0003 \\
    learning rate for critic-network & 0.0003 \\
    learning rate for Lipschitz function & 0.01 \\
    random seed & 12  \\
    \bottomrule
  \end{tabular}
\end{table}

The evaluation environments were from Gym\footnote{https://www.gymlibrary.ml/}. The PPO was used as the base model \citep{pytorch_minimal_ppo}, including an actor-network and a critic network. The network architecture for the actor was Linear(input, 64)-Identity(64)-Tanh-Linear(64, 64)-Identity(64)-Tanh-Linear(64, output) and a Softmax was added for discrete actions; the architecture for the critic network was Linear(input, 64)-Tanh-Linear(64, 64)-Tanh-Linear(64, 1). All algorithms shared exactly the same critic network. The basic actor network was also the same but BNN-based algorithms were assigned prior to the network parameters. The used hyperparameters are given in Table \ref{hyper-table}. Apart from the basic control environments, we also tested our proposed algorithm on MuJoCo benchmarks \footnote{https://www.gymlibrary.dev/environments/mujoco/index.html}, including Hopper and Humanoid. Here, we used two-layer BNN in FWVPO and more hyperparameters are given in Table \ref{hyper-table-mujoco}. 

The 2-Wasserstein distance (also known as Fréchet distance \citep{DOWSON1982450}) between two Gaussian distributions used by \textbf{BNN-W-PPO} was evaluated as
\[\mathcal{W}_2^{2}=|\mu _{X}-\mu _{Y}|^{2}+tr(\Sigma _{X}+\Sigma _{Y}-2(\Sigma _{X}\Sigma _{Y})^{1/2}) \]
and the KL divergence between function samples used by \textbf{fBNN-KL-PPO} was evaluated using grid KL in \citep{ma2021functional} where a geometric distribution was firstly used to sample a measure set size and then a number of observations were uniformly sampled from the buffer and then the spectral stein gradient estimator \citep{shi2018spectral} was used to estimate the KL divergence between marginal distributions on measurement set. FWVPO collected a set of states in memory before training and used it as the random global measurement set, where the collection was implemented using a random policy to interact with the environment. The Wasserstein distance optimization is based on the code from \citep{Tran2022}\footnote{https://github.com/tranbahien/you-need-a-good-prior}. 


\subsection{Setup details for noisy observations}


To simulate noisy environments in the paper, random noise was added to the observed state and then fed to RL agents for training at each time step. The values were selected to impact the performance of the base model (PPO) significantly. 

\begin{itemize}
    \item For \textbf{Acrobot}, we used multivariate normal distribution for noise generation:
\[\mathcal{N}\left (
\begin{bmatrix}
0 \\
0 \\
0 \\
0 \\
0\\
0 
\end{bmatrix}, 
\begin{bmatrix}
0.5 & 0 & 0 & 0 & 0 & 0\\
0 & 0.5 & 0 & 0 & 0 & 0\\
0 & 0 & 0.5 & 0 & 0 & 0\\
0 & 0 & 0 & 0.5 & 0 & 0\\
0 & 0 & 0 & 0 & 10 & 0\\
0 & 0 & 0 & 0 & 0 & 15
\end{bmatrix} \right).\]
\item For \textbf{CartPole}, we used $\mathcal{N}\left (
\begin{bmatrix}
0 \\
0 \\
0 \\
0 
\end{bmatrix}, 
\begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 0.1 & 0 \\
0 & 0 & 0 & 1 
\end{bmatrix} \right)$.
\item For \textbf{Hopper}, \textbf{Walker2d} and \textbf{Humanoid}, we used $\mathcal{N}\left ( \mathbf{0}, 0.1 * \mathbf{I}\right )$, where $\mathbf{I}$ is the identify matrix. 
\item For \textbf{Halfcheetah}, we used $\mathcal{N}\left ( \mathbf{0}, 2 * \mathbf{I}\right )$, where $\mathbf{I}$ is the identify matrix. 
\end{itemize}


\begin{figure*}[!t]
\begin{subfigure}[t]{0.33\textwidth}
\centering
\includegraphics[width=\textwidth]{aistats/figure/PPO_msizeHopper-v2_fig_2.pdf}
\caption{Effect of measurement size}
\label{fig:msize}
\end{subfigure}
\begin{subfigure}[t]{0.33\textwidth}
\centering
\includegraphics[width=\textwidth]{aistats/figure/PPO_nbnn_Walker2d-v2_fig_2.pdf}
        \caption{Effect of BNN layer number}
         \label{fig:nbnn}
\end{subfigure}
\begin{subfigure}[t]{0.33\textwidth}
\centering
\includegraphics[width=\textwidth]{aistats/figure/PPO_nsampleHumanoid-v2_fig_2.pdf}
         \caption{Effect of function sample number}
         \label{fig:nsample}
\end{subfigure}
\caption{Parameter sensitivity analysis}
\end{figure*}



\subsection{Setup details for environment variations}


\begin{itemize}
\item For \textbf{CartPole}\footnote{https://github.com/openai/gym/blob/master/gym/envs/classic\_control/cartpole.py}, we revised its some parameters to obtain the changed environments. We firstly changed its transition by revising its one line of \textit{step()} from  
\begin{displayquote}
temp = (force + self.polemass\_length * theta\_dot**\textbf{2} * sintheta) / self.total\_mass
\end{displayquote}
to
\begin{displayquote}
temp = (force + self.polemass\_length * theta\_dot**\textbf{4} * sintheta) / self.total\_mass
\end{displayquote}
and we also changed the \textit{self.gravity = 9.8} to \textit{self.gravity = 9.8 + \textbf{x}} where \textbf{x} was set as 5, 10, 15, 20, and 25. Such variates are expected to change the underlying dynamics. 
\item For \textbf{Hopper}, we changed its reward calculation by revising its one line of \textit{step()} from  
\begin{displayquote}
 reward -= 1e-3 * np.square(a).sum()
\end{displayquote}
to
\begin{displayquote}
 reward -= x * np.square(a).sum()
\end{displayquote}
where x was set as 1e-3, 1e-2, 0.1 and 0; 
\item For \textbf{Humanoid}, we firstly changed its transition by revising its one line of \textit{step()} from  
\begin{displayquote}
 lin\_vel\_cost = 1.25 * (pos\_after - pos\_before) / self.dt   
\end{displayquote}
to
\begin{displayquote}
lin\_vel\_cost = x * (pos\_after - pos\_before) / self.dt
\end{displayquote}
and also changed 
\begin{displayquote}
quad\_ctrl\_cost = 0.1 * np.square(data.ctrl).sum()
\end{displayquote}
to
\begin{displayquote}
quad\_ctrl\_cost = y * np.square(data.ctrl).sum()
\end{displayquote}
where x was set as 3.25, 5.25, 7.25 and 9.25 and y was set as 0.01, 0.001, 0.001 and 0.001. 
\item For \textbf{Walker2d}, we firstly revised its one line of \textit{step()} from 
\begin{displayquote}
alive\_bonus = 1.0
\end{displayquote}
to
\begin{displayquote}
alive\_bonus = x
\end{displayquote}
and also changed 
\begin{displayquote}
reward -= 1e-3 * np.square(a).sum()
\end{displayquote}
to
\begin{displayquote}
reward -= y * np.square(a).sum()
\end{displayquote}
where $x$ was set as 1.0, 2.0, 3.0 and 4.0 and $y$ was set as $1e-3$, $1e-2$, $1e-1$ and $0.05$. 
\item For \textbf{Halfcheetah}, we revised its one line of \textit{step()} from 
\begin{displayquote}
reward\_ctrl = -0.1 * np.square(action).sum()
\end{displayquote}
to
\begin{displayquote}
reward\_ctrl = -x * np.square(action).sum()
\end{displayquote}
where x was set as 0.1, 0.15, 0.01, and 0.2. 
\end{itemize}




\section{Parameter sensitivity analysis}

We studied the contributions from three hyperparameters: measurement size, BNN layer number and function sample number. The effect from measurement size (Line5 of Algorithm 1) is shown in Figure \ref{fig:msize}, where we observed that increasing the number of measurement sizes could generally improve the performance. For example, 256 and 512 were better than 32 and 64, while there is not much difference between 256 and 512. We used 64 as the default for previous experiments. The effect from the number of BNN layers is shown in Figure \ref{fig:nbnn}, where we observed that more BNN layers would take more steps to get converged so the performance of one and three layers was better than the other four options within 2e6 steps. Among one and three layers, the three-layer one was better. We used 3 as default for previous experiments. The effect from the number of function samples (Lines 7 and 8 of Algorithm 1) is shown in Figure \ref{fig:nsample}, where we observed that 10 is the best among all options and we used it as the default for the previous experiments. 
 



\end{document}
