Sample Complexity for Obtaining Sub-optimality and Violation bound for Distributionally Robust Constrained MDP
Confirmation: Yes
Keywords: Constrained MDP, Robust MDP, Sample Complexity bound
TL;DR: This is the first sample complexity result for robust constrained MDP
Abstract: We consider the problem of learning a safe policy that will maximize the cumulative reward while satisfying a constraint even when there is a mismatch between the testing and training environment. In particular, we consider the *robust* constrained Markov decision problem (CMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the *unknown* uncertainty set. Such a problem poses significant additional challenges compared to the non-robust CMDP problem and the unconstrained robust MDP problem. We seek to characterize the number of samples required to bound both the sub-optimality gap and the violations by at most $\epsilon$. We observe that the primal-dual-based approaches that achieve sample complexity bounds for non-robust CMDP cannot achieve the same in the robust CMDP case as the strong duality does not exist even when Slater's condition is satisfied. Nevertheless, we propose a robust safe value learning algorithm by considering an approach where *rectified* penalty for the constraint violation is added with the objective. We consider that the algorithm has access to a generative model of the *nominal* (training) environment around which the uncertainty set is defined. We show that our proposed algorithm can achieve a policy with $\epsilon$ suboptimality gap and violation bound after $\mathcal{\tilde{O}}(H^5|S||A|/\epsilon^2)$ samples; where $|S|$ is the cardinality of the state-space, $|A|$ is the cardinality of the action space, and $H$ is the length of the episode for uncertainty sets specified by various popular distance metrics. *This is the first result that achieves a sample complexity bound for robust CMDP problems*.
Submission Number: 12
Loading