\paragraph{RL with robust MDP} Robust MDPs allow the transition kernel to take values from an uncertainty set. The objective of robust MDPs is to learn an optimal robust policy that maximizes the worst-case value function. When the exact information about MDP is known, this can be solved through dynamic programming methods \cite{iyengar2005robust,nilim2005robust,mannor2012lightning}.  
If one has access to a generative model, several model-based reinforcement learning methods are proven to be statistically efficient \cite{panaganti2022sample,yang2021towards}. Similar results can also be achieved if an offline dataset is present, for which previous works \cite{qi2020robust,zhou2021finite,kallus2022doubly,ma22distribution} show the $O(1/\epsilon^2)$ sample complexity for an $\epsilon$-optimal policy. In addition, \cite{liu2022distributionally} proposed distributionally robust policy Q-learning, which solves for the asymptotically optimal Q-function. 

In an online setting, the only results available are asymptotic. In the case of discounted MDPs, \cite{wang2021online,badrinath2021robust} study the policy gradient method and show an $O(\epsilon^{-3})$ convergence rate for an alternative learning objective (a smoothed variant), which could be equivalent to the original policy gradient objective in an asymptotic regime.
%To our best knowledge, most of the literature on robust RL focuses on asymptotic results or sample complexity results. 
These results in sample complexity and asymptotic regimes, in general, cannot imply sublinear regret in robust MDPs \cite{dann2017unifying}.  We summarize known results in the online setting in Table~\ref{table:compare}. We note that value estimation~\cite{panaganti2022sample,yang2021towards} does not directly lead to an optimal policy but we convert the rates by applying an additional value iteration step. 

\paragraph{RL with adversarial MDP}
We differ our problem setup from another framework, often referred to as the adversarial MDP, where the MDP parameters can be adversarially chosen while the agent interacts with the environment. 
This problem is more challenging than robust MDP because robust MDP assumed that the agent interacts with a fixed environment and is tested on adversarial tasks. In general, adversarial MDP is proved to be NP-hard \cite{even2004experts}.
Several works study the variant where the adversarial could only modify the reward function, while the transition dynamics of the MDP remain unchanged.
In this case, it is possible to obtain policy-based algorithms that are efficient with a sublinear regret \cite{rosenberg2019online,jin2020simultaneously,pmlr-v119-jin20c,shani2020optimistic,cai2020provably}.
Alternatively, researchers investigate the setting where the transition is only allowed to be adversarially chosen for $C$ out of the $K$ total episodes. A regret of $O(C^2 + \sqrt{K})$ are established thereafter \cite{lykouris2021corruption,chen2021improved,zhang2022corruption}.

%When the rewards are assumed to be adversarial and the transitions are still determined, policy-based algorithms are shown to be efficient with a sublinear regret. 
%In the case where the transition is adversarially chosen without any restrictions, it is NP-hard to obtain low regret. 
 

\paragraph{Non-robust policy optimization}
Policy optimization has been extensively investigated under non-robust MDPs \cite{neu2010online,cai2020provably,shani2020optimistic,wu2022nearly,chen2021minimax}. The proposed methods are proven to be able to achieve sublinear regret.
The methods are also closely related to empirically successful policy optimization algorithms in RL, such as PPO \cite{schulman2017proximal} and TRPO \cite{schulman2015trust}. %Several works extend these policy-based methods to consider robustness in MDPs or even adversarial MDPs \cite{jin2019learning,shani2020optimistic,he2021nearly}. However, such robustness or adversarial concerns are only addressed towards the reward function, whereas the formulation of robust MDPs considers the robustness of the dynamic of the MDPs.