Before we introduce our algorithm, we first illustrate the importance of taking uncertainty into consideration. With the robust MDP, one of the most naive methods is to train a policy directly with the nominal transition model. However, the following proposition shows an optimal policy under the nominal policy can be arbitrarily bad in the worst-case transition (even worse than a random policy).  

\begin{restatable}[Suboptimality of non-robust optimal policy]{claim}{hard}\label{prop:hard}
   There exists a robust MDP $\gM = \langle \gS, \gA, \gP, r, H \rangle$ with uncertainty set $\gP$ of uncertainty radius $\rho$, such that the non-robust optimal policy is $\Omega(1)$-suboptimal to the uniformly random policy. 
 \end{restatable}
The proof of Proposition \ref{prop:hard} is deferred to Appendix \ref{appendix:prop}. 
This result is obviously not ideal, and it motivates us to to propose an algorithm that works well even when the models mismatch. Indeed, we present below the robust optimistic policy optimization (Algorithm \ref{alg}), which enjoys a sublinear regret and desired practical performance.

%  \begin{minipage}{0.7\linewidth}

\begin{algorithm}[htb]
    \caption{Robust Optimistic Policy Optimization (ROPO)} 
    \label{alg}
    \begin{algorithmic}
    \STATE Input: learning rate $\beta$, bonus function $b_h^k$.
        \FOR{$k = 1, \ldots, K$}
        \STATE Collect a trajectory of samples by executing $\pi_k$.
        %\STATE{ {\color{gray}\# Robust Policy Evaluation }}
        \FOR{$h = H, \ldots, 1$}
        \FOR{ $\forall (s,a) \in \gS \times \gA$}
        \STATE Solve $\sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^k)(s,a)$, according to Equation (\ref{eq:inner_sa}) for $(s,a)$-rectangular set or Equation (\ref{eq:inner_s}) for $s$-rectangular set.
        \STATE $\hat{Q}^{k}_h (s,a) = \min\left\{\hat{r}(s,a) + \sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^k)(s,a)\right. $ $ \left.+ b_h^k(s,a), H\right\}$.
        %\STATE $\hat{Q}^{k}_h (s,a) = \min \left\{\hat{Q}^{k}_h (s,a), H \right\}, \forall (s,a) \in \gS \times \gA$.
        \ENDFOR
        \FOR{ $\forall s \in \gS$}
        \STATE $\hat{V}_h^k(s) = \left\langle \hat{Q}_h^k(s, \cdot), \pi_h^k(\cdot \mid s) \right\rangle$.
        \ENDFOR
        \ENDFOR
        %\STATE{ {\color{gray} \# Policy Improvement}}
        %\FOR{$\forall h, s, a \in [H] \times \gS \times \gA$}
        \STATE $\pi_h^{k+1}(a \mid s) = \frac{\pi_h^{k}(a\mid s)\exp(\beta \hat{Q}^{\pi}_h (s,a))}{\sum_{a^\prime}  \pi_h^{k}(a^\prime\mid s)\exp(\beta \hat{Q}^{\pi}_h (s,a^\prime))} $, $\forall h, s, a \in [H] \times \gS \times \gA$
        %\ENDFOR
        \STATE Update empirical estimate $\hat{r}$, $\hat{P}$ with Equation (\ref{eq:empirical}).
        \ENDFOR
    \end{algorithmic}
\end{algorithm}
%{\color{blue} edit this alg}
%   \end{minipage}%

\subsection{Robust optimistic policy optimization}
%[repeat policy/value argument and then get motivated]
With the presence of the uncertainty set, the optimal policies may be all randomized \citep{wiesemann2013robust}. In such cases, value-based methods may be insufficient as they usually rely on a deterministic policy. We thus resort to optimistic policy optimization methods~\cite{shani2020optimistic}, which directly learn a stochastic policy. 

Our algorithm performs policy optimization with empirical estimates and encourages exploration by adding a bonus to less explored states. However, we need to propose a new efficiently computable bonus that is robust to adversarial transitions. We achieve this by solving a sub-optimization problem derived from Fenchel conjugate. We present Robust Optimistic Policy Optimization (ROPO) in Algorithm \ref{alg} and elaborate on its design components.

\paragraph{The empirical model}
To start, as our algorithm has no access to the actual reward and transition function, we use the following empirical estimator of the transition and reward: 
\begin{align}\label{eq:empirical}
    \hat{r}_h^k(s,a) = \ & \frac{\sum^{k - 1}_{k^\prime = 1} R_h^{k^\prime}(s,a)\mathbb{I}_{s_h^{k^\prime},a_h^{k^\prime}}^{s,a}}{N_h^k(s,a)} \,, \nonumber\\
    \hat{P}_h^{o,k}(s,a,s^\prime) = \ & \frac{\sum^{k - 1}_{k^\prime = 1} \mathbb{I}_{s_h^{k^\prime},a_h^{k^\prime},s_{h+1}^{k^\prime}}^{s,a, s^\prime}}{N_h^k(s,a)} \,,
\end{align}
where 
\begin{align*}
    \mathbb{I}_{s_h^{k^\prime},a_h^{k^\prime}}^{s,a} = \ & \mathbb{I} \left\{\left(s_h^{k^\prime},a_h^{k^\prime}\right) = (s,a) \right\} \\
    N_h^k(s,a) = \ & \max \left\{ \sum^{k-1}_{k^\prime = 1}  \mathbb{I}_{s_h^{k^\prime},a_h^{k^\prime}}^{s,a},1\right\}
\end{align*}
counts the number of visits to $(s,a)$. 

\paragraph{Challenge: Optimistic robust policy evaluation}
As in standard optimistic algorithms, Algorithm \ref{alg} estimates $Q$-values with an optimistic variant of the Bellman equation to encourage exploration in the robust MDP. The bonus term $b_h^k(s,a)$ compensates for the lack of knowledge
of the actual reward and transition model as well as the uncertainly set, with order $b_h^k(s,a) = O( N_h^k(s,a)^{-1/2})$.
% \begin{align}
%     \hat{Q}^{k}_h (s,a) = \ & \min\big\{\hat{r}(s,a) + \sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s,a)  + b_h^k(s,a), H\big\} \,, \quad 
%     \hat{V}_h^k(s) = \ & \left\langle \hat{Q}_h^k(s, \cdot), \pi_h^k(\cdot \mid s) \right\rangle\,. \nonumber
% \end{align}

However, in the robust MDP setting, analyzing the bonus term can be tricky. Intuitively, the bonus term $b_h^k$ desires to characterize the optimism required for efficient exploration for both the estimation errors of $P$ and the robustness of $P$. 
It is hard to control the two quantities in their primal (original) form because it is unclear how the error in estimating $P$ would impact the choice of an estimated robust action $\sigma_{\hat{\gP}_h}$.

We propose the following procedure to address the problem.
%While the above algorithm is effective, the computations involved may be costly. At each update step, one is required to construct an empirical estimate of the uncertainty set and then solve an inner minimization problem $\sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s)$. The inner problem is a high-dimensional optimization problem. We now show that the inner minimization problem can be reduced to a lower dimensional optimization problem, which enjoys computation efficiency. 
Note that the key difference between our algorithm and standard policy optimization is that $\sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s)$ requires solving an inner minimization (\ref{eq:sigma}). 
Through relaxing the constraints with Lagrangian multiplier and Fenchel conjugates, under $(s,a)$-rectangular set, the inner minimization problem can be reduced to a one-dimensional unconstrained convex optimization problem on $\mathbb{R}$ (Lemma \ref{lem:sa_con}). 
\begin{align}\label{eq:inner_sa}
    \sup_{\eta} & \ \eta - \frac{ (\eta - \min_s \limits \hVpn(s))_{+}}{2}\rho  \nonumber\\
    & - \sum_{s^\prime}\hat{P}_h^o(s^\prime \mid s,a) \left( \eta - \hVpn(s^\prime)\right)_{+} \,.
\end{align}
The optimum of Equation (\ref{eq:inner_sa}) can be computed efficiently with bisection or sub-gradient methods. More importantly, this form allows us to estimate how the error of estimating the transition kernel impact the estimated value function while bypassing $\sigma_{\hat{\gP}_h}$.
%We note that while the dual form has been similarly used before under the presence of a generative model or with an offline dataset \citep{badrinath2021robust,panaganti2022sample,yang2021towards}, it remains unclear whether it is effective for the online setting. 


Similarly, in the case of $s$-rectangular set, the inner minimization problem is equivalent to a $A$-dimensional convex optimization problem.
\begin{align}\label{eq:inner_s}
    \sup_{\eta} \ &\sum_{a^\prime} \eta_{a^\prime} -  \sum_{s^\prime, a^\prime} \hat{P}_h^o(s^\prime \mid s,a^\prime) \left(\eta_{a^\prime} - \mathbb{I}\{a^\prime = a\} \hat{V}_{h+1}^{\pi_k}(s^\prime) \right)_{+} \nonumber\\
    & \ - \min_{s^\prime, a^\prime}\frac{A \rho  (\eta_{a^\prime} - \mathbb{I}\{a^\prime = a\} \hat{V}_{h+1}^{\pi_k}(s^\prime))_{+}}{2} \,,
\end{align}%\vspace{-0.1cm}
where $a \sim \pi_k(s)$.
%This optimum in $\mathbb{R}^A$ can be computed efficiently in $\tilde{O}(A)$ iterations by methods like gradient descent.

%With equation \ref{eq:inner_s} and gradient descent methods, the solution of the inner problem can be computed efficiently with a computation complexity of $O(A)$.

In addition to reducing computational complexity, the dual form (Equation (\ref{eq:inner_sa}) and Equation (\ref{eq:inner_s})) decouples the uncertainty in estimation error and in robustness, as $\rho$ and $\hat{P}_h^o$ are in different terms. The exact form of $b_h^k$ is presented in the Equation (\ref{bonus_sa}) and (\ref{bonus_s}).

\paragraph{Policy improvement step}
Using the optimistic $Q$-value obtained from policy evaluation, the algorithm improves the policy with a KL regularized online mirror descent step, $\pi_h^{k+1} \in \arg\max_{\pi } \limits \beta \langle \nabla \hat{V}_{h}^{\pi_k}, \pi \rangle- \pi_h^k + D_{KL} (\pi || \pi_h^k)$, where $\beta$ is the learning rate.
Equivalently, the updated policy is given by the closed-form solution
$\pi_h^{k+1}(a \mid s) = \frac{\pi_h^{k}\exp(\beta \hat{Q}^{\pi}_h (s,a))}{\sum_{a^\prime} \pi_h^{k}(a^\prime\mid s)\exp(\beta \hat{Q}^{\pi}_h (s,a^\prime))} $.
An important property of policy improvement is to use a fundamental inequality (\ref{eq:omd}) of online mirror descent presented in \citep{shani2020optimistic}. We suspect that other online algorithms with sublinear regret could also be used in policy improvement.

In the non-robust case, this improvement step is also shown to be theoretically efficient \citep{shani2020optimistic,wu2022nearly}. Many empirically successful policy optimization algorithms, such as PPO \citep{schulman2017proximal} and TRPO \cite{schulman2015trust}, also take a similar approach to KL regularization for non-robust policy improvement. Putting everything together, the proposed algorithm is summarized in Algorithm \ref{alg}.



%\paragraph{Solving $\sigma_{\hat{\gP}_h}(\hat{V}_{h+1}^\pi)(s)$}


