Sample Complexity of Model-Based Robust Reinforcement Learning

Kishan Panaganti, Dileep Kalathil

Published: 2021, Last Modified: 12 May 2023CDC 2021Readers: Everyone

Abstract: We consider the problem of learning the optimal robust value function and the optimal robust policy in discounted-reward Robust Markov Decision Process (RMDP). The goal of the RMDP framework is to find a policy that is robust to the parameter uncertainties due to the mismatch between the simulator model and real-world settings. While the optimal robust value function and policy can be computed using robust dynamic programming, it requires the exact knowledge of the nominal simulator model and the uncertainty set around it. This paper proposes a model-based robust reinforcement learning algorithm that learns an -optimal robust value function and policy in a finite state and action space setting when the exact knowledge of the nominal simulator model is not known. We assume access to a standard generative sampling model, which can generate next-state samples for all state-action pairs of the nominal simulator model. We give a precise characterization of the sample complexity of obtaining an ϵ-optimal robust value function and policy using our algorithm. Finally, we demonstrate the performance of our algorithm on some benchmark problems.

0 Replies