A Safe Bayesian Learning Algorithm for Constrained MDPs with Bounded Constraint Violation
TL;DR: Safe Bayesian Learning for CMDPs
Abstract: Constrained Markov decision processes (CMDPs) models are increasingly important in many applications with multiple objectives. When the model is unknown and must be learned online, it is desirable to ensure that the constraint is met, or at least the violation is bounded with time. In recent literature, progress has been made on this very challenging problem but with either unsatisfactory assumptions such as the knowledge of a safe policy, or have high cumulative regret. We propose the Safe-PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm efficiently trades-off exploration and exploitation using posterior sampling-based exploration, and yet provably suffers only bounded constraint violation using carefully-crafted pessimism. We establish a sub-linear $\tilde{O}(H^{2.5}\sqrt{|S|^2|A|K})$
upper bound on the Bayesian objective regret along with a bounded, i.e., $\tilde{O}(1)$ constraint-violation regret over
$K$ episodes for an $|S|$-state, $|A|$-action, and $H$ horizon CMDP which improves over state-of-the-art algorithms for the same setting.
Submission Number: 954
Loading