Abstract: Robust Markov decision processes (MDPs) tackle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization, which can significantly increase computational complexity and limit scalability. On the other hand, policy regularization improves learning stability without impairing time complexity. Yet, it does not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that policy regularization methods solve a particular instance of robust MDPs with uncertain rewards. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We then introduce twice regularized MDPs ($\text{R}^2$ MDPs), i.e., MDPs with value *and* policy regularization. The corresponding Bellman operators lead to planning and learning schemes with convergence and generalization guarantees, thus reducing robustness to regularization. We numerically show this two-fold advantage on tabular and physical domains, and illustrate the persistent efficacy of \rr regularization.
Submission Number: 70
Loading