\appendix
\onecolumn



\section{Probabilistic evaluation bounds}
\label{app:numsamples}

Here we provide a quantitative investigation into the choice of method to evaluate the quantities of interest for a given policy, including a comparison of the probabilistic bounds on the errors due to finite numbers of samples.
We compare the efficiency required to achieve an evaluation within a certain accuracy $\varepsilon$ with a minimum probability $1-\delta$ for methods that (i) carry out Monte Carlo sampling for every evaluation step and (ii) (ours, Algorithm \ref{alg:uncertainty}) carry out an exact calculation of the return distribution moments and then Monte Carlo evaluation with samples from the dynamics posterior.
The quantity we investigate in detail is the Bayesian value at a given state for a given policy (appearing in Eq. \ref{eq:objective} for a given policy and state), and since aleatoric and epistemic uncertainties are calculated in very similar fashion, the conclusions regarding 
Bayesian value estimation will also carry through to the uncertainty quantification case.

\subsection{Exact moments}

The quantity of interest we wish to approximate is 
\begin{equation}
    \hat{V} = \mathbb{E_\mathcal{M}}(V_\mathcal{M}(s)),
\end{equation}
where the expectation is taken over the Dirichlet posterior of MDP dynamics parameters. For a given set of dynamics parameters $\mathcal{M}$, we have access to the closed form expression for the first moment of the return distribution $V_\mathcal{M}(s)$ (in terms of policy, dynamics and reward) as presented in Eq. \ref{eq:bellmansolution}.

We assume a bounded reward $|r|\leq r_\text{max}$ and employ the well-known form of the Hoeffding inequality \cite{hoeffding} valid for the random variable $S_n = \sum_{i=1}^{n}{X_i}$ with $X_i$ bounded and i.i.d. such that $\mathbb{E}(S_n)=\mu$:
\begin{equation}
    \mathbb{P}(|S_n - \mu| \leq \epsilon) \geq 1 - 2\exp{\left(-\frac{2\epsilon^2}{n \Delta^2}\right)}
\end{equation}
with $\Delta$ being the size of the interval on which $X$ can take values.

In context, we take $X_i = \frac{1}{N_M} V_i$ as the closed-form expression for the value of the $i^\text{th}$ of the $N_M$ dynamics samples, so $\mu = \hat{V}$.
From the boundedness assumption on the reward, we can also bound $|V_i|\leq \frac{r_\text{max}}{1-\gamma} = V_\text{max}$ and $\Delta \leq 2 V_\text{max}/N_M$.
We require enough samples so that with probability at least $1-\delta$ the error in our approximation of $\hat{V}$ is within $\epsilon$ of the true value.
By the Hoeffding inequality, we can ensure this is the case by choosing $N_M$ such that
\begin{equation}
    \delta \leq 2\exp{\left(-\frac{N_M\epsilon^2}{2V_\text{max}^2}\right)},
\end{equation}
which corresponds to the smallest integer $N_M$ such that
\begin{equation}
    N_M \geq \log\left(\frac{2}{\delta}\right)\left(\frac{2 V_\text{max}^2}{\epsilon^2}\right).
\end{equation}

\subsection{Monte-Carlo sampling}

The alternative method to using closed-form expressions for the moments of the return distribution given an MDP sample would be to in turn approximate these through Monte Carlo samples, as done in \cite{learningtodefer}.
To do so, given the infinite horizon nature of the MDPs we are considering, we would have to accumulate rewards over a roll-out with a finite number of steps $T$, thus incurring in some error, which can be bounded above by $\gamma^T V_\text{max}$. Note that the tightness of this bound will depend entirely on the reward structure of the MDP, and that this is not a source of error that can be reduced by repeatedly sampling transitions.
For the purposes of the analysis presented, we will be generous in mostly ignoring the computational cost associated with sampling trajectories for a given MDP.
In practice, sampling from a categorical distribution (i.e. sampling the trajectories for a given MDP) is significantly faster than sampling from a Dirichlet distribution (i.e. sampling the transition matrix), so we incorporate the overall computational cost of trajectory sampling into the modest condition that $T$ cannot be arbitrarily large, but assume infinite trajectory sampling capability otherwise.
This assumption allows us to determine the value for the $i^\text{th}$ given MDP arbitrarily accurately up to this error, so that the distance between the true value $V_i$ to the accumulated finite sum of rewards $V_i'$ will be bounded by $|V_i - V'_i|\leq \gamma^T V_\text{max}$.

Thus, we can consider the distance
\begin{align}
    \left|\hat{V} - \frac{1}{N_M}\sum_i V'_i\right| &\leq
    \left|\hat{V} - \frac{1}{N_M}\sum_i V_i\right| + \left|\frac{1}{N_M}\sum_i V_i - \frac{1}{N_M}\sum_i V'_i\right| \\
    & \leq \left|\hat{V} - \frac{1}{N_M}\sum_i V_i\right| + \gamma^T V_\text{max},
\end{align}
so that if 
\begin{equation}
    \left|\hat{V} - \frac{1}{N_M}\sum_i V_i\right| + \gamma^T V_\text{max} \leq \epsilon,
\end{equation}
with probability at least $1-\delta$, then the distance to the original estimate also satisfies
\begin{equation}
    \left|\hat{V} - \frac{1}{N_M}\sum_i V'_i\right| \leq \epsilon.
\end{equation}
with at least probability $1-\delta$.

As such, we apply the Hoeffding inequality in the form
\begin{equation}
    \mathbb{P}\left(\left|\hat{V} - \frac{1}{N_M}\sum_i V_i\right| \leq \epsilon - \gamma^T V_\text{max}\right) \geq 1 - 2\exp{\left(-\frac{N_M^2(\epsilon - \gamma^T V_\text{max})^2}{2V_\text{max}^2}\right)}.
\end{equation}
Note that this also imposes a minimum horizon truncation of $T>\log(\epsilon/V_\text{max})/\log\gamma$. Explicitly including the probability threshold $\delta$ now corresponds to finding an $N_M$ such that
\begin{equation}
    \delta \leq 2\exp{\left(-\frac{N_M^2(\epsilon - \gamma^T V_\text{max})^2}{2V_\text{max}^2}\right)},
\end{equation}
so
\begin{equation}
    N_M \geq \log\left(\frac{2}{\delta}\right)\frac{2 V_\text{max}^2}{(\epsilon - \gamma^T V_\text{max})^2}.
\end{equation}

This bound corresponds to a worsening by a factor of $(1-\gamma^T V_\text{max}/\varepsilon)^{-2}$ in the number of samples required to get comparable accuracy to the method that uses exact moments.
For example, for the gridworld setup considered ($\gamma=0.999, r_\text{max}=1$ and positing $\epsilon=0.001$) would require an order of magnitude of $T\approx 10^5$ for every rolled out trajectory, (of which we are assuming to be able to carry out an arbitrarily large number to obtain this bound) at which point the contribution of the trajectory sampling to the bottleneck would be severe and require a completely different bound to take it into account.
Thus, for the regime we consider, choosing to compute exact moments does save computation towards the computational bottleneck of taking samples from a Dirichlet posterior.

Note that aleatoric and epistemic uncertainty will behave similarly: aleatoric variance is an analogous expectation over the second instead of first moment (which we again can have in closed-form or can estimate through Monte Carlo samples) and the bound will be analogous. Similarly, for epistemic variance the error in return due to truncated trajectories will compound when calculating the variance over expected returns, and again we expect a similarly greater number of samples for $N_M$.

\section{Stochastic Optimal Policy}
\label{app:casino}
Here we provide an illustrative example of how the Bayesian objective Eq.~\ref{eq:objective} for expected value when MDPs are sampled from some distribution may not have a deterministic optimal policy.

Consider the following `casino' MDP with three states. State $s$ corresponds to the player being in the casino, where the possible actions are to play or leave. Being in state $s$ costs the player 1 unit of currency every time that state $s$ is visited. The outcome of leaving is to deterministically transition to a terminal state with no further rewards. On the other hand, playing has a stochastic outcome, with a probability $\theta$ of losing, in which case the player remains in state $s$, and a probability $1-\theta$ of winning, in which case the player transitions to state $w$, where they receive a payout of $R$ units and then deterministically transition to the terminal state with no further rewards.
Thus, each realisation of $\theta$ corresponds to a slightly different MDP with different probabilities of winning and therefore different optimal policies.

For a policy that plays with probability $\pi$ and leaves with probability $1-\pi$, the Bellman equation for value of state $s$, $V$, under this policy is
\begin{equation}
    V = -1 + \gamma(\pi \theta V + \pi (1-\theta) R).
\end{equation}
Solving for $V$ gives the value starting from state $s$ for a specific MDP:
\begin{equation}
    V = \frac{-1+\gamma \pi (1-\theta) R}{1-\gamma \pi \theta}.
\end{equation}
We now consider the expected value when $\theta$ is sampled from some distribution. For example, if $\theta$ is sampled from a Bernoulli distribution with parameter $p=\frac{1}{2}$, the expected value of $\pi$ over this distribution of MDPs is
\begin{equation}
    V = \frac{1}{2}\left(-1+\gamma \pi R -\frac{1}{1-\gamma\pi}\right).
\end{equation}

We visualise this value as a function of $\pi$ in Fig.~\ref{fig:valuegraph} for $R=10$, $\gamma=0.99$. The maximum value is not achieved at $\pi=0$ or $\pi=1$, but rather at $\pi \approx 0.69$. A policy that never plays achieves a value of $1$, a policy that always plays a value of $-45.55$ and the optimal (stochastic) policy a value of about $1.34$.
Thus, we have an example where there is no deterministic optimal policy. 
\begin{figure}
    \centering
    \includegraphics[width=0.7\textwidth]{figures/valuegraph.pdf}
    \caption{Plot of average value across a distribution of casino MDPs as a function of policy.}
    \label{fig:valuegraph}
\end{figure}

\section{Computational scalability}
\label{app:rebuttal_comp}

We present in Fig.\ref{fig:rebuttal_comp} empirical results with regards to scaling our method to larger state-spaces.
The experiment we benchmark is the one on synthetic MPDs, as carried out in section \ref{sec:syntheticmdps} but with varying state-space sizes for two different posterior sample batch sizes.
We also consider both the case where the posterior is resampled every gradient step (as in the synthetic MDP experiments) as well as the case where the posterior is only sampled once at the start of training.
While the latter is not suggested in practice, the resulting copmutation time bounds the compute time that can be saved by resampling periodically between gradient steps rather than at every step during training.

\begin{figure}
    \centering
    \includegraphics[width=0.7\columnwidth]{rebuttal_figures/computefig.pdf}
    \caption{Computation time required to run Algorithm \ref{alg:improvement} for 1000 gradient steps. We display results for batches of sizes 8 and 256, for the case where the posterior is sampled before each gradient step (resampling) or only once at the start of training (no resampling).}
    \label{fig:rebuttal_comp}
\end{figure}

\section{Policy uncertainty evaluation}
\label{app:paul}

The policy we present and compare results for is the policy that optimises the maximum likelihood estimate (MLE) of the transition dynamics MDP, where transition probability is taken to be the relative frequency of observed transitions, which we refer to as the MLE-optimal policy.

Running SARSA policy evaluation on the methods proposed in \cite{paul} explicitly shows that the epistemic uncertainty in the dynamics transition is not captured by the ensemble method used. Fig.~\ref{fig:paul} shows that with this setup, epistemic uncertainty correlates with loss but is independent of amount of data observed. This is visible as the curves collapse to small epistemic uncertainty values irrespective of data set size even though the amount of data in the smallest data set size (25) is smaller than the total number of transitions of the MDP (80). This is because it captures information on parametric training uncertainty but not of the dynamics model uncertainty.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{figures/loss_and_epistemic.pdf}
    \caption{Plot of the epistemic uncertainty and loss as a function of training timestep demonstrating that epistemic is not accurately tracked by previous methods. Epistemic standard deviation (top row, red data) is quantified here over 10k time steps, corresponding to the agent carrying out transitions over many episodes.
    The corresponding ensemble quantile regression loss (bottom row, blue data) at each training timestep is shown below.
Here we show as examplar the results for fixed policy using ensemble methods with a MLE-dynamics model for different number of observed transitions in the dataset generated by the gridworld with $p_\text{rand}=0.5$. The value that the epistemic standard deviation converges to is always small for all visited states and independent of dataset size as the only notion of uncertainty captured in this setup is one of parametric uncertainty and not MDP uncertainty.}
    \label{fig:paul}
\end{figure}

\section{Multi Sample Backward Induction}
\label{app:msbi}
%todo
We apply here a variant of the method MSBI presented in \cite{robustvalueiter}, which outputs a policy that is near-optimal with respect to the Bayes-adaptive optimal policy under some assumptions. The main assumption is that the belief change between timesteps is bounded, and the authors show that a backwards induction greedy algorithm can yield a policy that has Bayesian utility which lower bounds the adaptive Bayes-optimal utility within an error term proportional to this bound.
This optimal adaptive utility will, however, in general be different to the posterior value expectation that we are aiming to maximise with our policy, so the algorithm's purpose is not entirely aligned with ours. 
Nonetheless, in practice the proposed algorithm involves carrying out an analogous version of value iteration where at each timestep the iterative value is given by also marginalising over MDPs.
This is also possible to implement and test in our setting so we provide here the relevant results for completeness, although we emphasise that the theoretical foundations and guarantees of near-optimality don't apply to our specific case.
For a fairer comparison to the gradient-based methods, we also use the nominal policy as the starting policy in this algorithm.

\paragraph{Gridworld}
The work suggests a number of posterior samples of the order of $\left(\epsilon(1-\gamma)\right)^{-3}\approx10^{14}$ using $\gamma=0.999$ and an error tolerance on the value of $\epsilon=0.01$, which is a computationally intractable number of samples to store and process for transition matrices.
Thus, we use a number of samples ($N_M=32768$) and maximum number of iterations $(2000)$ that roughly match the computation time of the gradient-optimised policy (30-60s depending on dataset without GPU acceleration for the gridworld experiments).
In Fig.~\ref{fig:gridmsbi} we show the performance of MSBI on the relative posterior expected value objective for the same gridworld as in section \ref{sec:gridworld} over 50 runs, where regions above the red line correspond to improved posterior value maximisation with our algorithm.
As may be expected from the gap between the algorithm's original intention and our application, MSBI consistently underperforms with the exception of high data regimes, where the probability mass collapses on one MDP and the algorithm essentially reduces to value iteration.

\paragraph{Synthetic MDPs}
We also apply MSBI to synthetic MDPs as presented in section \ref{sec:syntheticmdps} and report results in Figs.\ref{fig:syntheticmsbi} and \ref{fig:syntheticmsbibayes}. Once again, due to the large number of experiments ran (250 runs each for each of the 10 different dataset sizes) we had to reduce the number of posterior samples to be $N_M=2048$ and fix the maximum number of iterations to $500$.

\begin{figure}
    \centering
    \begin{subfigure}[b]{0.3\textwidth}
        \raisebox{0.38cm}{\includegraphics[width=\columnwidth]{figures/gradmsbicomp.pdf}}
        \caption{Gradient vs MSBI gridworld posterior expected value}
        \label{fig:gridmsbi}
    \end{subfigure}
    \begin{subfigure}[b]{0.3\textwidth}
        \includegraphics[width=\columnwidth]{synthetic_figures/msbi_ground_truth.pdf}
        \caption{Gradient vs MSBI synthetic MDP ground truth performance (shaded standard error of the mean)}
        \label{fig:syntheticmsbi}
    \end{subfigure}
    \begin{subfigure}[b]{0.3\textwidth}
        \raisebox{0.38cm}{\includegraphics[width=\columnwidth]{synthetic_figures/msbi_bayes.pdf}}
        \caption{Gradient vs MSBI synthetic MDP posterior expected value}
        \label{fig:syntheticmsbibayes}
    \end{subfigure}
    \caption{Relative performance of MSBI on gridworld posterior expected value objective over 50 runs (Fig.~\ref{fig:gridmsbi}) and synthetic MDP ground truth performance (Fig. \ref{fig:syntheticmsbi}, with shaded standard error of the mean) and posterior value objective (Fig.~\ref{fig:syntheticmsbibayes}) over 250 runs.}
    \label{fig:msbi}
\end{figure}

\section{Synthetic MDPs relative performance}
\label{app:rebuttal_perf}

We display in Figs.~\ref{fig:rebuttal_perf1} and~\ref{fig:rebuttal_perf2} results corresponding to those presented in Figs.~\ref{fig:syntheticgroundtruth} and ~\ref{fig:syntheticbayes} but with the $y$-axis scaled by the value achieved by our method (resulting in a different scaling value for each dataset size), so that the new resulting plot can be interpreted as a fractional relative improvement.

\begin{figure*}
    \centering
    \begin{subfigure}[b]{0.3\textwidth}
        \includegraphics[width=\columnwidth]{rebuttal_figures/3a.pdf}
        \caption{Gradient vs MLE-optimal}
    \end{subfigure}
     \begin{subfigure}[b]{0.3\textwidth}
        \includegraphics[width=\columnwidth]{rebuttal_figures/3b.pdf}
        \caption{Gradient vs Nominal}
    \end{subfigure}
    \begin{subfigure}[b]{0.3\textwidth}
        \includegraphics[width=\columnwidth]{rebuttal_figures/3c.pdf}
        \caption{Gradient vs Second order}
    \end{subfigure}
    \caption{Ground truth pairwise difference in average performance (and shaded standard error of the mean) normalised by average performance of the Gradient (ours) method for each dataset. Regions above the red line correspond to improved performance with our method.}
    \label{fig:rebuttal_perf1}
\end{figure*}

\begin{figure*}
    \centering
    \begin{subfigure}[b]{0.3\textwidth}
        \includegraphics[width=\columnwidth]{rebuttal_figures/4a.pdf}
        \caption{Gradient vs MLE-optimal}
    \end{subfigure}
     \begin{subfigure}[b]{0.3\textwidth}
        \includegraphics[width=\columnwidth]{rebuttal_figures/4b.pdf}
        \caption{Gradient vs Nominal}
    \end{subfigure}
    \begin{subfigure}[b]{0.3\textwidth}
        \includegraphics[width=\columnwidth]{rebuttal_figures/4c.pdf}
        \caption{Gradient vs Second order}
    \end{subfigure}
    \caption{Average and standard deviation (shaded) of posterior expected value normalised by average performance of the Gradient (ours) method for each dataset. Regions above the red line correspond to improved performance with our method.}
    \label{fig:rebuttal_perf2}
\end{figure*}


\section{Clinical data discussion}
\label{app:clinical}
\subsection{General discussion}

Bayesian inference with Dirichlet distributions with a large number of possible outcomes (next states) is problematic, as mentioned in section \ref{sec:bayesdyanmics} \citep{dirichlethierarchical}, and careful thought must be given to what prior to employ.
First we consider a Bayesian model selection approach: we assume all possible states are reachable and symmetric.
This allows us to optimise the model evidence with respect to the unique parameter $\alpha_p$ of the prior, in the hope that specifying a prior which is more in line with the observations will lead to better inference (see Appendix \ref{sec:modelselection} for details).
As expected, the optimal $\alpha_p$ is found to be much smaller than 1, $\alpha_p=0.072$, giving less weight after inference to the prior than the maximum-entropy $\alpha_p=1$ prior does.
However, this approach still fails to accurately model our belief, which can be seen by considering the following scenario: suppose the patient is in a bad state and has two options, namely (a) try a treatment that has been attempted many times with rare success or (b) try a treatment that has always gone wrong, but has been tried a small number of times so has high uncertainty in the outcome.
Option (b) is clearly not appealing, but the agent's posterior will still place significant probability mass on unobserved states in the presence of a small number of transitions, thus highly encouraging the agent to take the less visited action and assigning it a disproportionately high value.
Upon inspection, this is exactly what is happening in the outlier state in Fig.~\ref{fig:statevalues}a (at approximate coordinates $(0.6,0.8)$), and the value given by this Bayesian posterior is likely unreasonable.

To address this, we introduce conservatism by considering only observed states and the death state as next possible states, thus ensuring a more conservative prior. 
Inducing conservatism in offline RL with datasets that do not adequately cover the full state-action space is in line with literature \citep{optimistic, bear}, and conservative MDP models have found success in continuous offline RL by modulating reward \citep{mopo, morel} or dynamics \citep{pessimisticmodel}, somewhat analogously to what is being proposed here.
By only including observed or negative outcomes, the agent is unable to place probability mass on unsupported next-states and therefore use high uncertainty to inflate the value of poorly visited actions in bad states.
The scarcity of outcomes allows for meaningful inference using a maximum-entropy prior with $\alpha_p=1$, and a high-entropy prior is favorable from a conservatism standpoint. It encourages the agent to select actions that have sufficient support to offset the high prior probability mass assigned to the death state.
The Bayesian values inferred with this setup are presented in Fig.~\ref{fig:statevalues}b.
Fig.~\ref{fig:statevalues} shows the possible improvement, according to the Bayesian posterior value, of employing the Bayesian gradient-optimised policy compared to the MLE-optimal policy used in \cite{aiclinician}, resulting in higher probability of survival (according to the dynamics model).
In particular, we note that employing the gradient-optimised policy improves the value, and therefore corresponding approximate probability of survival, by about $2.1\%$ when averaged across states, with a maximum improvement on a particular state of $17.8\%$, according to the conservative Bayesian dynamics model.

In Fig.~\ref{fig:uncertaintyvis} we show how the MIMIC-III states aleatoric and epistemic uncertainties are related. The values are computed using the same conservative dynamics model of Fig.~\ref{fig:statevalues}b.

\begin{figure}
    \centering
     \includegraphics[width=0.5\textwidth]{figures/uncertainty_grid.pdf}
     \caption{States plotted according to their epistemic and aleatoric standard deviations. Each dot represents a state, with its colour corresponding to its average value according to the Bayesian posterior.}
     \label{fig:uncertaintyvis}
\end{figure}

As expected for the particular reward structure of the MDP considered, aleatoric uncertainty and average Bayesian value are strongly related: since the return variable is approximately binomial (approximately $1$ for success and $0$ for failure) its mean and variance are related straightforwardly. Note this will not be true for MDPs with more general reward structures.

\subsection{Bayesian model selection}
\label{sec:modelselection}

To determine the prior that for the dynamics model with results presented in Fig.~\ref{fig:statevalues}a, we carry out Bayesian model selection by minimising the negative log-marginal likelihood of the data with respect to the parameter $\alpha_p$.
To remain consistent with the limitation that only actions observed at least 5 times in the data should be employed at each state, we only use the data for such state-action transitions when determining the optimal $\alpha_p$.

For each state-action, the full form of the Dirichlet prior in terms of $\alpha_p$ is \cite{dirichlethierarchical}
\begin{equation}
    p(\{\theta_{s,a}^{s_j}|s_i\in\mathcal{S}\}) = \frac{\Gamma(|\mathcal{S}|\alpha_p)}{\Gamma(\alpha_p)^{|\mathcal{S}|}}\prod_j (\theta_{s,a}^{s_j})^{\alpha_p-1},
\end{equation}
where $\Gamma$ is the gamma function. The likelihood is
\begin{equation}
    p(\mathcal{D}|\theta) = \prod_j (\theta_{s,a}^{s_j})^{n_j},
\end{equation}
with $n_j$ being the number of observed transitions from state-action $s,a$  to state $s_j$.
Hence, the model evidence is
\begin{align}
    p(\mathcal{D}) &= \int d\theta p(\mathcal{D}|\theta)p(\theta) \\
    &= \frac{\Gamma(|\mathcal{S}|\alpha_p)}{\Gamma(\alpha_p)^{|\mathcal{S}|}} \frac{\prod_{j}\Gamma(\alpha_p+n_j)^{|\mathcal{S}|}}{\Gamma(|\mathcal{S}|\alpha_p + N_{s,a})},
\end{align}

with $N_{s,a}$ being the number of observed transitions from state-action $s,a$. Since transitions are independent across state-actions, taking the negative logarithm of this quantity and summing across all state-actions results in the overall negative log-marginal likelihood for the dataset in terms of $\alpha_p$.
The resulting function of $\alpha_p$ is visualised in Fig.\ref{fig:modelselection} and attains a minimum value at approximately $\alpha_p=0.072$.

\begin{figure}
    \centering
     \includegraphics[width=0.5\textwidth]{figures/model_selection.pdf}
     \caption{Negative log-marginal likelihood for clinical data dynamics model against parameter $\alpha_p$ of the prior.}
     \label{fig:modelselection}
\end{figure}

