\section{Introduction}
\label{sec:intro}
In safety-critical machine learning applications, accurately quantifying confidence and uncertainty in decision outcomes becomes imperative for regulatory and trust reasons \citep{uncertaintyinhc, disentanglinguncertainties}.  Uncertainties that such systems face can stem from limited data availability (epistemic) or originate from inherent environmental randomness (aleatoric). Uncertainty quantification is particularly relevant and challenging in reinforcement learning (RL) systems as uncertainty in decisions' outcomes compounds in sequential decision-making.
%The task of disentangling these uncertainties gains importance in real-world scenarios that are either discrete state in nature or arise where continuous environmental variables are clustered into a finite number of discrete states \citep{aiclinician, discretisation}. In such cases, a coarse representation of the environment can introduce aleatoric uncertainty due to information lost in compressing an environment into a finite number of states.


We utilise inference schemes from classic Bayesian RL to account for epistemic uncertainty, with an exact-inference Bayesian dynamics model that assigns posterior probabilities to environments \cite{bamdp}.
Aleatoric uncertainty is addressed by exploiting the analytic solutions to the linear equations for higher return distribution moments \citep{sobel}.
Our contribution on the uncertainty quantification side is combining these two ingredients to determine overall aleatoric and epistemic standard deviations.
We consider the computation and accuracy tradeoff of our method with prior work that does not exploit the tractability of finite-state MDPs.

On the control under uncertainty side, we propose a novel stochastic gradient-based method for policy optimisation that accounts for model dynamics uncertainty by optimising a policy for value under the environment posterior.
In contrast to previous methods \citep{robustmdppercentile}, we do not rely on strong assumptions about the posterior distribution.
We empirically demonstrate its performance and scalability, providing results on gridworlds and synthetic MDPs with varying offline dataset sizes, where we observe benefits in MDPs with higher uncertainty and lower data.
Finally, our methods finds application in clinical decision support systems (CDSS), which leverage vast patient datasets to train RL algorithms for treatment suggestions \citep{guidelinesforrl, luchen}. We analyse a setup used for sepsis treatment \citep{aiclinician}, where patients' condition and treatment options were clustered into finite states and actions, originally tackled by applying dynamic programming methods that assume known environments \citep{bellman1957}.
The methods presented enhance this approach with uncertainty quantification and uncertainty-aware control.
We investigate the scalability of our methods in such practical environments and investigate the difference in expected posterior value between ours and the original policy optimisation method proposed.
For this analysis, we included pessimism in the face of uncertainty, a common and necessary ingredient in offline RL \citep{morel, mopo, diversifiedqensemble} especially when the dataset does not adequately span the full state-action space \citep{optimistic}, in the form of a conservative dynamics model.