Bayesian Risk-Sensitive Policy Gradient For MDPs With General Loss Functions
Keywords: Bayesian Method, Reinforment Learning, Convex RL, Policy Gradient, Markove Decision Processes
Abstract: Motivated by many application problems, we consider Markov decision processes (MDPs) with a general convex loss function and unknown parameters. To mitigate the epistemic uncertainty associated with unknown parameters, we take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional (with respect to the Bayesian posterior distribution) on the loss. Since this formulation usually does not satisfy the interchangeability principle, it does not admit Bellman equations and cannot be solved by approaches based on dynamic programming. Therefore, We propose a policy gradient optimization method, leveraging the dual representation of coherent risk measures and extending the envelope theorem to continuous cases. We then show the algorithm converges to a stationary point with the rate of
$\mathcal{O}(T^{-1/2}+r^{-1/2})$, where $T$ is the number of policy gradient iterations and $r$ is the sample size of the gradient estimator. We further extend our algorithm to an episodic setting, and show the extended algorithm converges to a globally optimal policy and provide bounds on the number of iterations needed to achieve an error bound $\mathcal{O}(\epsilon)$ in each episode.
Primary Area: reinforcement learning
Submission Number: 15603
Loading