\documentclass{article}


% if you need to pass options to natbib, use, e.g.:
\PassOptionsToPackage{numbers, compress}{natbib}
% before loading neurips_2022


% ready for submission
%\usepackage{neurips_robustseq_2022}


% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
%     \usepackage[preprint]{neurips_robustseq_2022}


% to compile a camera-ready version, add the [final] option, e.g.:
\usepackage[final]{neurips_robustseq_2022}


% to avoid loading the natbib package, add option nonatbib:
%    \usepackage[nonatbib]{neurips_robustseq_2022}


\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors

\input{ICLR/math_commands}
\usepackage[small,bf]{caption}
%\usepackage[table,xcdraw]{xcolor}
\usepackage{natbib}
\bibliographystyle{plainnat}
\usepackage{amsthm,apxproof}
\usepackage{thmtools,amsmath,amsfonts} \usepackage{amssymb}% http://ctan.org/pkg/amssymb
\usepackage{pifont}% http://ctan.org/pkg/pifont
\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\ding{55}}%
\usepackage{thm-restate}
\newcommand{\dong}[1]{\textcolor{blue}{\textbf{[Dong: #1]}}}
\usepackage{algorithmic,algorithm}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{mathtools}
\usepackage{multirow}
\usepackage{hyperref}
\usepackage{url}
\newtheorem{thm}{Theorem}
\newtheorem{lem}{Lemma}
\newtheorem{cor}{Corollary}[section]
\newtheorem{prop}{Proposition}[section]
\newtheorem{asmp}{Assumption}[section]
\newtheorem{defn}{Definition}[section]
\newtheorem{oracle}{Oracle}[section]
\newtheorem{claim}{Claim}[section]
\newtheorem{conj}{Conjecture}[section]
\newtheorem{rem}{Remark}[section]
\newtheorem{example}{Example}[section]
\newtheorem{condition}{Condition}[section]
\allowdisplaybreaks

\title{Online Policy Optimization for Robust MDP}


% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.
\newcommand*\samethanks[1][\value{footnote}]{\footnotemark[#1]}
\author{%
  Jing Dong \thanks{Author names are listed in alphabetical order.}\\
  The Chinese University of Hong Kong, Shenzhen \\
  \texttt{jingdong@link.cuhk.edu.cn} \\
  % examples of more authors
   \And
   Jingwei Li \samethanks[1]\\
   Tsinghua University \\
   \texttt{ljw22@mails.tsinghua.edu.cn} \\
   \AND
   Baoxiang Wang  \samethanks[1]\\
    The Chinese University of Hong Kong, Shenzhen \\
   \texttt{bxiangwang@cuhk.edu.cn} \\
   \And
   Jingzhao Zhang  \samethanks[1] \  \thanks{Jingzhao Zhang is also affiliated with Shanghai Qi Zhi Institute and Shanghai Artificial Intelligence Laboratory.}\\
   Tsinghua University \\
   \texttt{jingzhaoz@mail.tsinghua.edu.cn} \\
  % \And
  % Coauthor \\
  % Affiliation \\
  % Address \\
  % \texttt{email} \\
}


\begin{document}


\maketitle


\begin{abstract}
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to slight perturbation of the environment. 
The robust Markov decision process (MDP) framework---in which the transition probabilities belong to an uncertainty set around a nominal model---provides one way to develop robust models. 
While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. 
In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. 
To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. 
Our analysis establishes the first regret bound for online robust MDPs. 
\end{abstract}

\section{Introduction}
The rapid progress of reinforcement learning (RL) algorithms enables trained agents to navigate around complicated environments and solve complex tasks. The standard reinforcement learning methods, however, may fail catastrophically in another environment, even if the two environments only differ slightly in dynamics \citep{farebrother2018generalization,packer2018assessing,cobbe2019quantifying,song2019observational,raileanu2021decoupling}. In practical applications, such mismatch of environment dynamics are common and can be caused by a number of reasons, e.g., model deviation due to incomplete data, unexpected perturbation and possible adversarial attacks. To model the potential mismatch between system dynamics, the framework of robust MDP is introduced to account for the uncertainty of the parameters of the MDP \citep{satia1973markovian,white1994markov,nilim2005robust,iyengar2005robust}. Under this framework, the dynamic of an MDP is no longer fixed but can come from some uncertainty set, such as the rectangular uncertainty set, centered around a nominal transition kernel. The agent sequentially interacts with the nominal transition kernel to learn a policy, which is then evaluated on the worst possible transition from the uncertainty set. Therefore, the objective is to find the worst-case best-performing policy.

If a generative model (also known as a simulator) of the environment or a suitable offline dataset is available, one could obtain a $\epsilon$-optimal robust policy with $\Tilde{O}(\epsilon^{-2})$ samples under a rectangular uncertainty set \citep{qi2020robust,panaganti2022sample,wang2022policy,ma22distribution}. Yet the presence of a generative model is stringent to fulfill for real applications. In a more practical online setting, the agent sequentially interacts with the environment and tackles the exploration-exploitation challenge as it balances between exploring the state space and exploiting the high-reward actions. 
in the online setting, which is captured by the regret, is more challenging to achieve than algorithm convergence. 
In the robust MDP setting, previous sample complexity results cannot directly imply a sublinear regret in general \citet{dann2017unifying} and so far no asymptotic result is available. A more detailed review of the related works are deferred to the Appendix. 

In this paper, we propose the first policy optimization algorithm for robust MDP under a rectangular uncertainty set. One of the challenges for deriving a regret guarantee for robust MDP stems from its adversarial nature. As the transition dynamic can be picked adversarially from a predefined set, the optimal policy is in general randomized \citep{wiesemann2013robust}. This is in contrast with conventional MDPs, where there always exists a deterministic optimal policy, which can be found with value-based methods and a greedy policy (e.g. UCB-VI algorithms). Bearing this observation, we resort to policy optimization (PO)-based methods, which directly optimize a stochastic policy in an incremental way.


With a stochastic policy, our algorithm explores robust MDPs in an optimistic manner. To achieve this robustly, we propose a carefully designed bonus function via the dual conjugate of the robust bellman equation. This quantifies both the uncertainty stemming from the limited historical data and the uncertainty of the MDP dynamic. In the episodic setting of robust MDPs, we show that our algorithm attains sublinear regret $O(\sqrt{K})$ for both $(s,a)$ and $s$-rectangular uncertainty set, where $K$ is the number of episodes. In the case where the uncertainty set contains only the nominal transition model, our results recover the previous regret upper bound of non-robust policy optimization \citep{shani2020optimistic}. Our result achieves the first provably efficient regret bound in the online robust MDP problem. We further validated our algorithm with experiments.

\section{Problem formulation}
In this section, we describe the formal setup of robust MDP. We start with defining some notations.
\paragraph{Robust Markov decision process}
We consider an episodic finite horizon robust MDP, which can denoted by a tuple $\gM = \langle \gS, \gA, H, $ $\{\gP\}_{h=1}^H, \{r\}_{h=1}^H \rangle$. Here $\gS$ is the state space, $\gA$ is the action space, $\{r\}_{h=1}^H$ is the time-dependent reward function, and $H$ is the length of each episode. Instead of a fixed step of time-dependent uncertainty kernels, the transitions of the robust MDP is governed by kernels that are within a time-dependent uncertainty set $\{\gP\}_{h=1}^H$, $\ie$, time-dependent transition $P_h \in \gP_h \subseteq \Delta_{\gS}$ at time $h$. We consider the case where the rewards are stochastic. This is, on state-action $(s,a)$ at time $h$, the immediate reward is $R_h(s,a) \in [0,1]$, which is drawn i.i.d from a distribution with expectation $r_h(s,a)$.
With the described setup of robust MDPs, we now define the policy and its associated value.

\paragraph{Policy and robust value function}
A time-dependent policy $\pi$ is defined as $\pi = \{\pi_h\}_{h=1}^H$, where each $\pi_h$ is a function from $\gS$ to the probability simplex over actions, $\Delta(\gA)$. 
If the transition kernel is fixed to be $P$, the performance of a policy $\pi$ starting from state $s$ at time $h$ can be measured by its value function, which is defined as 
$
    V_h^{\pi, P}(s) = \mathbb{E}_{\pi, P}\left[ \sum^H_{h^\prime = h} r_{h^\prime}(s_{h^\prime}, a_{h^\prime}) \mid s_h = s\right] 
$.
In robust MDP, the robust value function instead measures the performance of $\pi$ under the worst possible choice of transition $P$ within the uncertainty set. Specifically, the value and the Q-value function of a policy given the state action pair $(s,a)$ at step $h$ are defined as 
\begin{align*}% \label{eq:robust_val}
    V^{\pi}_h (s) = \ & \min_{\{P_h\} \in \{\gP_h\}} V^{\pi, \{P\}}_h (s) \,, \nonumber \\
    Q^{\pi}_h (s,a) = \ & \min_{\{P_h\} \in \{\gP_h\}} \mathbb{E}_{\pi, \{P\}}\left[\sum^H_{h^\prime =h} r_h (s_{h^\prime} , a_{h^\prime} ) \mid (s_h,a_h) = (s, a)\right] \,.
\end{align*}
The optimal value function is defined to be the best possible value attained by a policy
$
    V^{\ast}_h (s) = \max_{\pi} V^{\pi}_h (s) = \max_{\pi} \min_{\{P_h\} \in \{\gP_h\}} V^{\pi, \{P\}}_h (s) 
$.
The optimal policy is then defined to be the policy that attains the optimal value.

\paragraph{Robust Bellman equation}
Similar to non-robust MDP, robust MDP has the following robust bellman equation, which characterizes a relation to the robust value function.
$
    Q^{\pi}_h (s,a) = r(s,a) + \sigma_{\gP_h}(V_{h+1}^\pi)(s,a)\,, \quad V^{\pi}_h (s) = \langle Q^{\pi}_h (s,\cdot), \pi_h(\cdot, s) \rangle
$, where
$%\label{eq:sigma}
    \sigma_{\gP_h}(V_{h+1}^\pi)(s,a) = \min_{P_h \in \gP_h} \limits P_h(\cdot \mid s,a) V_{h+1}^\pi \,,  P_h(\cdot \mid s,a) V = \sum_{s^\prime \in \gS} \limits P_h(s^\prime \mid s,a) V(s^\prime)\,.
$

Without additional assumptions on the uncertainty set, the optimal policy and value of the robust MDP are in general NP-hard to solve \citep{wiesemann2013robust}. Thus, to limit the level of perturbations, we assume that the transition kernels is close to the nominal transition measured via $\ell_1$ distance. We consider two cases.

%The $(s,a)$-rectangular assumption assumes that the uncertain transition kernel within the set takes value independently for each $(s,a)$. We further use $\ell_1$ distance to characterize the $(s,a)$-rectangular set around a nominal kernel with a specified level of uncertainty.
\begin{defn}[$(s,a)$-rectangular uncertainty set \citet{iyengar2005robust,wiesemann2013robust}]\label{def:sa}
For all time step $h$ and with a given state-action pair $(s,a)$, the $(s,a)$-rectangular uncertainty set $\gP_h(s,a)$ is defined as $
\gP_h(s,a) = \left\{\left\|P_h(\cdot \mid s,a) - P_h^o(\cdot \mid s,a) \right\|_1 \leq \rho,P_h(\cdot \mid s,a) \in \Delta(\gS) \right\} 
$,
where $P_h^o$ is the nominal transition kernel at $h$, $P_h^o(\cdot \mid s,a) > 0, \forall (s,a) \in \gS \times \gA$, $\rho$ is the level of uncertainty.
\end{defn}
%With the $(s,a)$-rectangular set, it is shown that there always exists an optimal policy that is deterministic \cite{wiesemann2013robust}. 

One way to relax the $(s,a)$-rectangular assumption is to instead let the uncertain transition kernels within the set take value independent for each $s$ only. This characterization is then more general and its solution gives a stronger robustness guarantee. 
\begin{defn}[$s$-rectangular uncertainty set \citet{wiesemann2013robust}]\label{def:s}
For all time step $h$ and with a given state $s$, the $s$-rectangular uncertainty set $\gP_h(s)$ is defined as 
$
\gP_h(s) = \left\{ \sum_{a \in \gA}\left\|P_h(\cdot \mid s,a) - P_h^o(\cdot \mid s,a) \right\|_1 \leq A \rho, P_h(\cdot \mid s,\cdot) \in \Delta(\gS)^{\gA}  \right\} 
$,
where $P_h^o$ is the nominal transition kernel at $h$, $P_h^o(\cdot \mid s,a) > 0, \forall (s,a) \in \gS \times \gA$, $\rho$ is the level of uncertainty.
\end{defn}
Different from the $(s,a)$-rectangular assumption, which guarantees the existence of a deterministic optimal policy, the optimal policy under $s$-rectangular set may need to be randomized \citep{wiesemann2013robust}. We also remark that the requirement of $P_h^o(\cdot \mid s,a) > 0$ is mostly for technical convenience. 


Equipped with the characterization of the uncertainty set, we now describe the definition of regret under the robust MDP. 

\paragraph{Learning protocols and regret}
We consider a learning agent repeatedly interacts with the environment defined by the nominal transition model in an episodic manner, over $K$ episodes. We remark that if the agent is asked to interact with a potentially adversarially chosen transition, the learning problem is NP-hard \cite{even2004experts}. We assume the agents always start from a fixed initial state $s$. The performance of the learning agent is measured by the cumulative regret incurred over the $K$ episodes, which is defined to be the cumulative difference between the robust value of $\pi_k$ and the robust value of the optimal policy. That is, 
$
\sum^K_{k=1} V_1^{\ast}(s_0) - V_1^{\pi_k} (s_0)
$,
where $s_0^k$ is the initial state.


\section{Algorithm}
Our algorithm performs policy optimization with empirical estimates and encourages exploration by adding a bonus to less explored states. However, we need to propose a new efficiently computable bonus that is robust to adversarial transitions. We achieve this via solving a sub-optimization problem derived from Fenchel conjugate. We present Robust Optimistic Policy Optimization (ROPO) and elaborate on its design components.

% \paragraph{The empirical model}
To start, as our algorithm has no access to the actual reward and transition function, we use the following empirical estimator of the transition and reward:
\begin{align}\label{eq:empirical}
    \hat{r}_h^k(s,a) =& \frac{\sum^{k - 1}_{k^\prime = 1} R_h^{k^\prime}(s,a)\mathbb{I} \left\{s_h^{k^\prime} = s, a_h^{k^\prime} = a\right\}}{N_h^k(s,a)} \,, \nonumber \\ \hat{P}_h^{o,k}(s,a) =& \frac{\sum^{k - 1}_{k^\prime = 1} \mathbb{I} \left\{s_h^{k^\prime} = s, a_h^{k^\prime} = a
    , s_{h+1}^{k^\prime} = s^\prime\right\}}{N_h^k(s,a)} \,,
\end{align}
where $N_h^k(s,a) = \max \left\{ \sum^{k-1}_{k^\prime = 1}  \mathbb{I}\left\{s_h^{k^\prime} = s, a_h^{k^\prime} = a\right\},1\right\}$.
%we use the empirical mean estimator (Equation \ref{eq:empirical}) $\hat{r}_h^k(s,a)$ and $\hat{P}_h^{o,k}(s,a)$ to estimate the reward and transition.

\paragraph{Robust Policy Evaluation step}
In each episode, the algorithm estimates $Q$-values with an optimistic variant of the bellman equation. 
Specifically, to encourage exploration in the robust MDP, we add a bonus term $b_h^k(s,a)$, which compensates for the lack of knowledge
of the actual reward and transition model as well as the uncertainly set, with order $b_h^k(s,a) = O\left(1 / \sqrt{N_h^k(s,a)} \right)$.
\begin{align*}
    \hat{Q}^{k}_h (s,a) = \min\left\{\hat{r}(s,a) + \sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s) + b_h^k(s,a), H\right\} \,.
\end{align*}
Intuitively, the bonus term $b_h^k$ desires to characterize the optimism required for efficient exploration for both the estimation errors of $P$ and the robustness of $P$. 
It is hard to control the two quantities in their primal form because of the coupling between them. We propose the following procedure to address the problem.

%While the above algorithm is effective, the computations involved may be costly. At each update step, one is required to construct an empirical estimate of the uncertainty set and then solve an inner minimization problem $\sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s)$. The inner problem is a high-dimensional optimization problem. We now show that the inner minimization problem can be reduced to a lower dimensional optimization problem, which enjoys computation efficiency. 
Note that the key difference between our algorithm and standard policy optimization is that $\sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s)$ requires solving an inner minimization. 
Through relaxing the constraints with Lagrangian multiplier and Fenchel conjugates, under $(s,a)$-rectangular set, the inner minimization problem can be reduced to a one-dimensional unconstrained convex optimization problem on $\mathbb{R}$ (Lemma \ref{lem:sa_con}). 
\begin{align}\label{eq:inner_sa}
    \sup_{\eta} \eta - \frac{ (\eta - \min_s \limits \hVpn(s))_{+}}{2}\rho - \sum_{s^\prime}\hat{P}_h^o(s^\prime \mid s,a) \left( \eta - \hVpn(s^\prime)\right)_{+} \,.
\end{align}
 The optimum of Equation (\ref{eq:inner_sa}) is then computed efficiently with bisection or sub-gradient methods. Similarly, in the case of $s$-rectangular set, the inner minimization problem is equivalent to a $A$-dimensional convex optimization problem, which can be computed efficiently in $\tilde{O}(A)$ iterations by methods like gradient descent. In addition to reducing computational complexity, the dual form decouples the uncertainty in estimation error and in robustness, as $\rho$ and $\hat{P}_h^o$ are not in different terms. The exact form of $b_h^k$ is presented in the Equation (\ref{bonus_sa}) and (\ref{bonus_s}).
In the case of $s$-rectangular set, the inner minimization problem is similarly equivalent to the following $A$-dimensional convex optimization problem.
\begin{align}\label{eq:inner_s}
    \sup_{\eta} \ \sum_{a^\prime} \eta_{a^\prime} -  \sum_{s^\prime, a^\prime} \hat{P}_h^o(s^\prime \mid s,a^\prime) \left(\eta_{a^\prime} - \mathbb{I}\{a^\prime = a\} V_{h+1}^{\pi_k}(s^\prime) \right)_{+} - \min_{s^\prime, a^\prime}\frac{A \rho  (\eta_{a^\prime} - \mathbb{I}\{a^\prime = a\} V_{h+1}^{\pi_k}(s^\prime))_{+}}{2} \,.
\end{align}

\paragraph{Policy Improvement Step}
Using the optimistic $Q$-value obtained from policy evaluation, the algorithm improves the policy with a KL regularized online mirror descent step, 
\begin{align*}
    \pi_h^{k+1} \in \arg\max_{\pi } \limits \beta \langle \nabla \hat{V}_{h}^{\pi_k}, \pi \rangle- \pi_h^k + D_{KL} (\pi || \pi_h^k) \,,
\end{align*}
where $\beta$ is the learning rate. In the non-robust case, this improvement step is also shown to be theoretically efficient \citep{shani2020optimistic,wu2022nearly}. Many empirically successful policy optimization algorithms, such as PPO \citep{schulman2017proximal} and TRPO \cite{schulman2015trust}, also take a similar approach to KL regularization for non-robust policy improvement.

%A more detailed description of the algorithm is deferred to the Appendix.


\begin{algorithm}[h]
    \caption{Robust Optimistic Policy Optimization (ROPO)} 
    \label{alg}
    \begin{algorithmic}
    \STATE Input: learning rate $\beta$, bonus function $b_h^k$.
        \FOR{$k = 1, \ldots, K$}
        \STATE Collect a trajectory of samples by executing $\pi_k$.
        \STATE{ {\color{gray}\# Robust Policy Evaluation }}
        \FOR{$h = H, \ldots, 1$}
        \FOR{ $\forall (s,a) \in \gS \times \gA$}
        \STATE Solve $\sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s,a)$ according to Equation (\ref{eq:inner_sa}) for $(s,a)$-rectangular set \\ or Equation (\ref{eq:inner_s}) for $s$-rectangular set.
        \STATE $\hat{Q}^{k}_h (s,a) = \min\left\{\hat{r}(s,a) + \sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s,a) + b_h^k(s,a), H\right\} $.
        \ENDFOR
        \FOR{ $\forall s \in \gS$}
        \STATE $\hat{V}_h^k(s) = \left\langle \hat{Q}_h^k(s, \cdot), \pi_h^k(\cdot \mid s) \right\rangle$.
        \ENDFOR
        \ENDFOR
        \STATE{ {\color{gray} \# Policy Improvement}}
        \FOR{$\forall h, s, a \in [H] \times \gS \times \gA$}
        \STATE $\pi_h^{k+1}(a \mid s) = \frac{\pi_h^{k}\exp(\beta \hat{Q}^{\pi}_h (s,a))}{\sum_{a^\prime} \exp(\beta \hat{Q}^{\pi}_h (s,a^\prime))} $.
        \ENDFOR
        \STATE Update empirical estimate $\hat{r}$, $\hat{P}$ with Equation (\ref{eq:empirical}).
        \ENDFOR
    \end{algorithmic}
\end{algorithm}


\section{Main results}

We are now ready to analyze the theoretical results of our algorithm under the uncertainly set.

%\subsection{Results under $(s,a)$-rectangular uncertainty set} Equipped with Algorithm \ref{alg} and the bonus function described in Equation \ref{bonus_sa}. We obtain the regret upper bound under $(s,a)$-rectangular uncertainty set described in the following Theorem.

\begin{restatable*}[Regret under $(s,a)$-rectangular uncertainty set]{thm}{sa}
\label{thm:sa}
With learning rate $\beta = \sqrt{\frac{2 \log A}{H^2 K}}$ and bonus term $b_h^k$ as (\ref{bonus_sa}), with probability at least $ 1 - \delta$, the regret incurred by Algorithm \ref{alg} over $K$ episodes is bounded by $O \left( H^2  S \sqrt{AK\log \left( SAH^2 K^{3/2} ( 1 + \rho) / \delta \right)}\right) $.
\end{restatable*}

\begin{rem}
When $\rho = 0$, the problem reduces to non-robust reinforcement learning. In such case our regret upper bound is $\tilde{O}\left( H^2 S \sqrt{AK} \right)$, which is in the same order of policy optimization algorithms for the non-robust case \citet{shani2020optimistic}. 
\end{rem}

Beyond the $(s,a)$-rectangular uncertainty set, we also extends to $s$-rectangular uncertainty set (Definition \ref{def:s}). 
\begin{restatable*}[Regret under $s$-rectangular uncertainty set]{thm}{s}
\label{thm:s}
With learning rate $\beta = \sqrt{\frac{2 \log A}{H^2 K}}$ and bonus term $b_h^k$ as (\ref{bonus_s}), with probability at least $ 1 - \delta$, the regret of Algorithm \ref{alg} is bounded by $O \left( SA^2 H^2\sqrt{K\log(SA^2H^2K^{3/2}(1+\rho) / \delta)}\right)$.
\end{restatable*}
%\input{s_rect}
\begin{rem}
When $\rho = 0$, the problem reduces to non-robust reinforcement learning. In such case our regret upper bound is $\tilde{O}\left( SA^2 H^2 \sqrt{K} \right)$. Our result is the first theoretical result for learning a robust policy under $s$-rectangular uncertainty set, as previous results only learn the robust value function \citep{yang2021towards}. 
\end{rem}
We defer the proof of these theorems, along with the experiments results of the proposed algorithm to the Appendix. 

\section{Conclusion}
In this paper, we studied the problem of regret minimization in robust MDP with a rectangular uncertainty set. We proposed a robust variant of optimistic policy optimization, which achieves sublinear regret in all uncertainty sets considered. Our algorithm delicately balances the exploration-exploitation trade-off through a carefully designed bonus term, which quantifies not only the uncertainty due to the limited observations but also the uncertainty of robust MDPs. Our results are the first regret upper bounds in robust MDPs as well as the first non-asymptotic results in robust MDPs without access to a generative model. 


\section*{Acknowledgement}
Jing Dong and Baoxiang Wang are partially supported by National Natural Science Foundation of China (62106213, 72150002) and Shenzhen Science and Technology Program (RCBS20210609104356063, JCYJ20210324120011032). Jingzhao Zhang is supported by Tsinghua University Initiative Scientific Research Program.

\bibliography{ICLR/iclr2023_conference}

\newpage
\appendix

\section{Importance of robustness }
With the robust MDP, one of the most naive methods is to directly train a policy with the nominal transition model. However, the following proposition shows an optimal policy under the nominal policy can be arbitrarily bad in the worst-case transition (even worse than a random policy).  
\begin{restatable}[Suboptimality of non-robust optimal policy]{claim}{hard}\label{prop:hard}
   There exists a robust MDP $\gM = \langle \gS, \gA, \gP, r, H \rangle$ with uncertainty set $\gP$ of uncertainty radius $\rho$, such that the non-robust optimal policy is $\Omega(1)$-suboptimal to the uniformly random policy. 
 \end{restatable}
The proof of Proposition \ref{prop:hard} is deferred to Appendix \ref{appendix:prop}. With the above-stated result, it implies the policy obtained with non-robust RL algorithms, can have arbitrarily bad performance when the dynamic mismatch from the nominal transition. This thus motivate our robust optimistic policy optimization \ref{alg} to avoid this undesired result.   


\section{Related works} 

\input{related}
\newpage
\section{Experiments}
\input{exp}
\newpage
\input{appendix}


\end{document}