
In this paper, we studied the problem of regret minimization in robust MDP with a rectangular uncertainty set. We proposed a robust variant of optimistic policy optimization, which achieves sublinear regret in all uncertainty sets considered. Our algorithm delicately balances the exploration-exploitation trade-off through a carefully designed bonus term, which quantifies not only the uncertainty due to the limited observations but also the uncertainty of robust MDPs. Our results are the first regret upper bounds the first non-asymptotic results in robust MDPs, without access to a generative model. 

For future works, while our analysis achieves the same bound as the policy optimization algorithm in ~\cite{shani2020optimistic} when the robustness level $\rho=0$, we suspect some technical details could be improved. For instance, we needed $P_h^o$ to be positive for any $s,a$ to form a solvable Fenchel dual. However, this positive value is canceled later and does not appear in the bound. This suggests that the strictly positive assumption may be an analysis artifact. Additionally, we can explore other uncertainty set characterizations, such as the Wasserstein distance metric. We can also extend robust MDPs to a broader class of MDPs, such as those with infinitely many states and function approximation.

%There are several possible future extensions of this work.  The other direction is to extend to robust MDPs with possibly infinite states and actions, in which case function approximations would be needed.   