\section{Related Work}

This section reviews uncertainty treatment in finite-state MDPs. We focus on epistemic uncertainty in Robust and Adaptive MDP settings, aleatoric uncertainty for risk-averse policies, and recent work quantifying both types of uncertainty in finite-state offline decision making contexts.

\textbf{Robust and Adaptive MDPs.}
A simple model-based approach for an MDP uses relative visitation frequencies as the ground truth transition probabilities. This can introduce bias and result in policies that generalise poorly \citep{robustmdpbias, mledynamics, cvarapproach}. To address this, a Bayesian approach is often employed to account for uncertainty in ambiguous transition dynamics, a common method in Bayesian RL \citep{bayesianrlreview}. Bayesian dynamics models used in Bayes-Adaptive MDPs (BAMDPs) \citep{bamdp, dirichletbamdpprior, riskaversebamdp} maintain the current belief in transition dynamics and enable optimal `offline' planning of adaptable `online' policy rollouts. However, these models may be intractable beyond simple MDPs \citep{beetle, scalablebamdp, varibad}.

In high-risk offline settings, exploration is undesirable. For instance, in the CDSS developed in \citep{aiclinician}, novel actions are avoided by only considering actions above a minimum visitation threshold. Therefore, we focus on optimal \textit{memoryless}, stationary (non-adaptive) policies depending only on the state \citep{robustmdppercentile}. Finding policies that are robust to the worst-case realisation of uncertain dynamics can often lead to overly conservative policies, making average value optimization across a distribution of MDPs a better alternative \citep{robustadversarial, robustmdp, robusttradeoff}, and a principled one in line with Bayesian decision theory \cite{bayesianchoice}. This will be the problem formulation we will be tackling here, while requiring that our methods scale to medium-sized (approximately $10^3$ states) MDPs in data regimes where significant uncertainty is still present, in order for them to be applicable to real-world tasks. \cite{robustmdppercentile} proposes a gradient-based method to optimise this objective, but makes the strong assumption that higher moments of the posterior distribution of transition parameters are small. \cite{robustvalueiter} proposes an algorithm that provides provably near-Bayes-optimal stationary policies, but focuses on providing an expected utility bound with respect to the Bayes-optimal \emph{adaptive} policy, which in general has different utility to the optimal offline stationary policy.

\textbf{Risk-averse policies.}
Accounting for inherent environmental stochasticity is often desirable. Using the distributional RL framework \citep{bellemare2017}, policies are often informed by return distribution properties other than its mean to select risk-averse actions \citep{dabney2018quantiles, uadqn}. However, optimal policies for such statistical functionals are generally neither memoryless nor time-consistent \citep{sobel, distributionalrlbook}. Therefore, we focus on using the mean of the return distribution to guide the agent's policy.

\textbf{Aleatoric and Epistemic Uncertainty in Finite-state MDPs.}
Several recent efforts have tried to model both types of uncertainties in discrete environments.
In a healthcare context, \citet{learningtodefer} used a Bayesian dynamics model and Monte Carlo trajectory sampling to estimate uncertainties and determine when to defer treatment. In contrast, \citet{paul} trained an ensemble of distributional deep neural networks (DNNs) to learn the return distribution, effectively learning a `distribution over return distributions'. Neither of these works exploit the benefits of having a stationary MDP, and we therefore complement them by directly exploiting the tractability advantages presented by finite-state MDPs, such as closed-form return distribution moments and the possibility for exact inference on the environment dynamics, with methods leading to computationally efficient and accurate uncertainty representation.