The rapid progress of reinforcement learning (RL) algorithms enables trained agents to navigate around complicated environments and solve complex tasks. The standard reinforcement learning methods, however, may fail catastrophically in another environment, even if the two environments only differ slightly in dynamics \cite{farebrother2018generalization,packer2018assessing,cobbe2019quantifying,song2019observational,raileanu2021decoupling}. In practical applications, such mismatch of environment dynamics are common and can be caused by a number of reasons, e.g., model deviation due to incomplete data, unexpected perturbation and possible adversarial attacks. Part of the sensitivity of standard RL algorithms stems from the formulation of the underlying Markov decision process (MDP). In a sequence of interactions, MDP assumes the dynamic to be unchanged, and the trained agent to be tested on the same dynamic thereafter. 
%While MDP provides a practical framework for solving sequential decision problems, it neglects possible uncertainty of the environment and can easily lead to a non-robust policy. 

To model the potential mismatch between system dynamics, the framework of robust MDP is introduced to account for the uncertainty of the parameters of the MDP \cite{satia1973markovian,white1994markov,nilim2005robust,iyengar2005robust}. Under this framework, the dynamic of an MDP is no longer fixed but can come from some uncertainty set, such as the rectangular uncertainty set (e.g. in \cite{iyengar2005robust, nilim2005robust}), centered around a nominal transition kernel. The agent 
can interact with a nominal transition kernel to learn a policy, which is then evaluated on the worst possible transition from the uncertainty set. Therefore, instead of searching for a policy that may only perform well on the nominal transition kernel, the objective is to find the worst-case best-performing policy. 

Robust MDP can be viewed as a dynamical zero-sum game, where the RL agent tries to choose the best policy while nature imposes the worst possible dynamics. When the environment transition and reward is known, solving the robust MDP problem is tractable under suitable assumptions due to the aforementioned works.  When the information about the environment is missing, if a generative model (also known as a simulator) of the environment or a suitable offline dataset is available, one could obtain a $\epsilon$-optimal robust policy with $\Tilde{O}(\epsilon^{-2})$ samples \cite{qi2020robust,panaganti2022sample,wang2022policy,ma22distribution}. 


%In the case where the uncertainty sets contain all possible transitions but the nominal transition model has no non-zero entries, our algorithm is still robust with sublinear regret.  

% Our contributions can be summarized as follows. 
% \begin{enumerate}
%     \item We propose the first algorithm for online robust MDPs with a rectangular uncertainty set. Our algorithm does not know the nominal model or the uncertainty set, and requires no access to a generative model or an offline dataset.
%     \item We establish the first sublinear regret upper bound for online robust MDPs under $(s,a)$ and $s$-rectangular uncertainty set. 
%     Table \ref{table:compare} shows detailed comparisons of the results and the settings considered. 
%     %\item To validate our theoretical findings, we conduct experiments on the GridWorld environment. 
% \end{enumerate}

\begin{table*}[htb]
\caption{Comparisons of previous results and our results, where $S,A$ are the size of the state space and action space, $H$ is the length of the horizon, $K$ is the number of episodes, $\rho$ is the radius of the uncertainty set and $\epsilon$ is the level of suboptimality. We shorthand $\iota = \log(SAH^2 K^{3/2} (1 + \rho))$. }\centering \label{table:compare}
\begin{tabular}{@{}ccccc@{}}
\toprule
                                            & Algorithm                                               & Reactangular & Regret                                                                            & \begin{tabular}[c]{@{}c@{}}Sample  Complexity\end{tabular} \\ \midrule
\cite{wang2021online}      & \begin{tabular}[c]{@{}c@{}}Value  based\end{tabular}  & $(s,a)$      & NA                                                                                & Asymptotic                                                   \\
\cite{badrinath2021robust} & \begin{tabular}[c]{@{}c@{}}Policy  based\end{tabular} & $(s,a)$      & NA                                                                                & Asymptotic                                                   \\
\hline
\multirow{2}{*}{\textbf{Ours}}              & \begin{tabular}[c]{@{}c@{}}Policy  based\end{tabular} & $(s,a)$      & \begin{tabular}[c]{@{}c@{}}$O \left( SH^2  \sqrt{AK\iota}\right)$\end{tabular} & $O\left( \frac{H^4 S^2 A \iota}{\epsilon^2}\right)$          \\
                                            & \begin{tabular}[c]{@{}c@{}}Policy  based\end{tabular} & s            & $O \left( SA^2 H^2\sqrt{K\iota}\right) $                                          & $O \left( \frac{H^4S^2 A^4  \iota}{\epsilon^2}\right)$       \\ \bottomrule
\end{tabular}
\end{table*}
% \begin{table*}[ht!]
% \centering\caption{Comparisons of previous results and our results, where $S,A$ are the size of the state space and action space, $H$ is the length of the horizon, $K$ is the number of episodes, $\rho$ is the radius of the uncertainty set and $\epsilon$ is the level of suboptimality. We shorthand $\iota = \log(SAH^2 K^{3/2} (1 + \rho))$. The regret upper bound by \cite{panaganti2022sample} is obtained by converting their sample complexity results and the sample complexity result for our work is converted through our regret bound. We use ``GM'' to denote the requirement of a generative model. {\color{black}The superscript $^*$ stands for results obtained via batch-to-online conversion.}  The reference to the previous works are [A]: \cite{panaganti2022sample}, [B]: \cite{wang2021online}, [C]: \cite{badrinath2021robust}, [D]: \cite{yang2021towards}.}\label{table:compare}
% \begin{tabular}{|c|c|c|c|c|c|}
% \hline
%                                & Algorithm                                                                   & Requires  & Rectangular        & Regret                                                                                                                                            & Sample Complexity                                                                                                                  \\ \hline
% {\cite{panaganti2022sample}}                        & \begin{tabular}[c]{@{}c@{}}Value\\ based\end{tabular}                   & GM            & $(s,a)$      & \begin{tabular}[c]{@{}c@{}}{\color{black}$O \left( 
% K^{\frac{2}{3}} H^{\frac{5}{3}} S^{\frac{2}{3}} A^{\frac{1}{3}}\right)^*$} \\ \end{tabular}                                                 & \begin{tabular}[c]{@{}c@{}}$O\left(\frac{H^4S^2A}{\epsilon^2} \right)$\\ \end{tabular}                           \\ \hline


% {[}B{]}                        & \begin{tabular}[c]{@{}c@{}}Value\\ based\end{tabular}                   & -            & $(s,a)$      & \begin{tabular}[c]{@{}c@{}}NA\\ \end{tabular}                                                 & Asymptotic                \\ \hline


% {[}C{]}                        & \begin{tabular}[c]{@{}c@{}}Policy\\ based\end{tabular}                   & -            & $(s,a)$      & \begin{tabular}[c]{@{}c@{}}NA\\ \end{tabular}                                                 & Asymptotic                \\ \hline
% \multirow{2}{*}{{[}D{]}}       & \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Value\\ based\end{tabular}}  & \multirow{2}{*}{GM} & $(s,a)$                                                                                  & NA                                                             & \begin{tabular}[c]{@{}c@{}}{\color{black}$\Tilde{O}\left(\frac{H^4S^2A(2 + \rho)^2}{\rho^2\epsilon^2} \right)$} \end{tabular} \\ \cline{4-6} &                                                                         &                     & $s$& NA                                             & \begin{tabular}[c]{@{}c@{}}{\color{black}$\Tilde{O}\left(\frac{H^4S^2A^2(2 + \rho)^2}{\rho^2\epsilon^2} \right)$} \end{tabular}   
%                                \\ \hline

% \multirow{2}{*}{\textbf{Ours}} & \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Policy\\ based\end{tabular}} & \multirow{2}{*}{-}  & $(s,a)$ &  $O \left( 
% SH^2  \sqrt{AK\iota}\right)$ & $O\left( \frac{H^4 S^2 A \iota}{\epsilon^2}\right)$                                                  

% \\ \cline{4-6} 
%                                &                                                                         & & $s$ & $O \left( SA^2 H^2\sqrt{K\iota}\right) $                      & $O \left( \frac{H^4S^2 A^4  \iota}{\epsilon^2}\right)$                                                                                                                            \\ \hline
% \end{tabular}
% \end{table*} 
%\dong{need to make a footnote of yang's result.}



However, the presence of a generative model is rare in real applications. Therefore, in this work, we consider an \emph{online} setting: the agent sequentially interacts with the environment and tackles the exploration-exploitation challenge as it balances between exploring the state space and exploiting the high-reward actions. A practical motivation for using the robust online MDP formation is as follows: The policy learned through reinforcement learning policy can only be obtained through interacting with a simulator, but ultimately, it is asked to minimize the regret in the real environment. However, the simulator can be inherently inaccurate since it is always just an approximation of the real world. The disparity between the simulation and real environment can be modeled through the online robust MDP setting.

The online setup was well-understood in standard MDP problems \cite{jin2019learning,rosenberg2019online,jin2020simultaneously}. 
%in the online setting, which is captured by the regret, is more challenging to achieve than algorithm convergence. 
Yet, in the robust MDP setting, previous sample complexity results cannot directly imply a sublinear regret, in general, \cite{dann2017unifying}. A natural question then arises:
% \vspace{-0.3cm}
\begin{center}

\textit{Can we design a robust RL algorithm that attains sublinear regret under robust MDP with rectangular uncertainty set?}
\end{center}
We propose the first policy optimization algorithm for robust MDP under a rectangular uncertainty set. One of the challenges for deriving a regret guarantee for robust MDP stems from its adversarial nature. As the transition dynamic can be picked adversarially from a predefined set, the optimal policy may be randomized \cite{wiesemann2013robust}. This is in contrast with conventional MDPs, where there always exists a deterministic optimal policy, which can be found with value-based methods and a greedy policy (e.g. UCB-VI algorithms). Bearing this observation, we resort to policy optimization (PO)-based methods, which directly optimize a stochastic policy in an incremental way.

With a stochastic policy, our algorithm explores robust MDPs in an optimistic manner. To achieve this robustly, we propose a carefully designed bonus function via the dual conjugate of the robust Bellman equation. This quantifies both the uncertainty stemming from the limited historical data and the uncertainty of the MDP dynamic. In the episodic setting of robust MDPs, we show that our algorithm attains sublinear regret $O(\sqrt{K})$ for both $(s,a)$ and $s$-rectangular uncertainty set, where $K$ is the number of episodes. In the case where the uncertainty set contains only the nominal transition model, our results recover the previous regret upper bound of non-robust policy optimization \cite{shani2020optimistic}. Our result achieves the first provably efficient regret bound in the online robust MDP problem, as shown in Table~\ref{table:compare}. We further validated our algorithm with experiments.


