
To validate our theoretical findings, we conduct a preliminary empirical analysis of our purposed robust policy optimization algorithm. We are committed to making our implementation public.
\begin{wrapfigure}[11]{l}{0.2\textwidth}
\centering
\includegraphics[width=0.2\textwidth]{grid.png}
\caption{Example of the Gridworld environment.}\label{fig:grid}
\end{wrapfigure}
\paragraph{Environment}

We conduct the experiments with the Gridworld environment, which is an early example of reinforcement learning \cite{sutton2018reinforcement}. The environment is two-dimensional and is in a cell-like environment. Specifically, the environment is a $5 \times 5$ grid, where the agent starts from the upper left cell. The cells consist of three types, road (labeled with $\circ$), wall (labeled with $\times$), and the reward state (labeled with $+$). 
%\begin{figure}[H]
%    \centering
    %\includegraphics[width=0.22\textwidth]{grid.png}    \caption{Example of the Gridworld environment.}\label{fig:grid}
%\end{figure}

The agent can walk through the road cell but not the wall cell. If it attempts to move to a wall cell, it will not move. Once the agent steps on the reward cell, it will receive a reward of 1, and it will receive no rewards otherwise. The goal of the agents is to collect as many rewards as possible within the allowed time.
The agent has four types of actions at each step, up, down, left, and right. After taking the action, the agent has a success probability of $p$ to move according to the desired direction, and with the remaining probability of moving to other directions uniformly randomly. %Figure \ref{fig:grid} shows an example of a gridworld instance. 
Figure \ref{fig:grid} shows an example of our environment.

%\vspace{-2.5cm}
\paragraph{Experiment configurations}
To simulate the robust MDP, we create a nominal transition dynamic with move success probability $p = 0.9$. The learning agent will interact with this nominal transition during training time and interact with a perturbed transition dynamic during evaluation. Under $(s,a)$-rectangular set, the transitions are perturbed against the direction the agent is directing with a constraint of $\rho$. Under $s$-rectangular set, the transitions are perturbed against the direction of the goal state. For example, if the agent chooses to go down to reach the goal state, the perturbation will be against the agent's direction (upward) by $\rho$. This adversarial change of transition is an implementation of the adversarial behavior described by robust MDP $\min_{\{P_h\} \in \{\gP_h\}} V^{\pi, \{P_h\}}_h (s)$.
%[concrete example of how the kernel change under what exactly rho]
It is obvious that the perturbation caused some of the optimal policies under nominal transition to be sub-optimal under robust transitions. 
\vspace{-0.1cm}
\begin{figure*}[htb]
     \centering
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{POMD_gridworld_rho_01.png}
         \caption{$\rho = 0.1$}
         \label{fig:rho1}
     \end{subfigure}
     %\hfill
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{POMD_gridworld_rho_02.png}
          \caption{$\rho = 0.2$}
         \label{fig:rho2}
     \end{subfigure}
     %\hfill
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{POMD_gridworld_rho_03.png}
          \caption{$\rho = 0.3$}
         \label{fig:rho3}
     \end{subfigure}
        \caption{Cumulative rewards obtained by robust and non-robust policy optimization on robust transition with different levels of uncertainty $\rho = 0.1, 0.2, 0.3$ under $\ell_1$ distance, $(s,a)$-rectangular set.}
        \label{fig:exp}
\end{figure*}
%\vspace{-0.8cm}
\begin{figure*}[htb]
     \centering
     \begin{subfigure}[b]{0.25\textwidth}
         \centering         \includegraphics[width=\textwidth]{POMD_gridworld_s_rect0.1.png}
         \caption{$\rho = 0.1$}
         \label{fig:s_rho1}
     \end{subfigure}
     %\hfill
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{POMD_gridworld_s_rect0.2.png}
          \caption{$\rho = 0.2$}
         \label{fig:s_rho2}
     \end{subfigure}
     %\hfill
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{POMD_gridworld_s_rect0.3.png}
          \caption{$\rho = 0.3$}
         \label{fig:s_rho3}
     \end{subfigure}
        \caption{Cumulative rewards obtained by robust and non-robust policy optimization on robust transition with different levels of uncertainty $\rho = 0.1, 0.2, 0.3$ under $\ell_1$ distance, $s$-rectangular set.}
        \label{fig:exp_2}
\end{figure*}

\paragraph{Results}
We denote the perturbed transition as robust transitions in our results.
%\dong{move the configuration to appendix?}
%\paragraph{Algorithm configuration}
We implement our proposed robust policy optimization algorithm along with the non-robust variant of it \cite{shani2020optimistic}. The inner minimization of our Algorithm \ref{alg} is computed through its dual formulation for efficiency. Our algorithm is implemented with the RLberry framework \citep{rlberry}.

We present results with $\rho = 0.1, 0.2, 0.3$ under $(s,a)$-rectangular set here in Figure \ref{fig:exp}, and under $s$-rectangular set here in Figure \ref{fig:exp_2}. We present the averaged cumulative rewards during evaluation. Regardless of the level of uncertainty and choice of uncertainty set, we observe that the robust variant of the policy optimization algorithm is more robust to dynamic changes as it is able to obtain a higher level of rewards than its non-robust variant. 


%\begin{table}[h]\centering
%\caption{Averaged cumulative rewards of the last $20$ episodes with POMD and Robust POMD under robust and nominal transitions. }\label{table:exp}\begin{tabular}{|c|c|c|c|}
%\hline
%\multicolumn{1}{|l|}{} & \begin{tabular}[c]{@{}c@{}}Robust POMD with \\ robust transition\end{tabular} & \begin{tabular}[c]{@{}c@{}}POMD with \\ robust transition\end{tabular} & \begin{tabular}[c]{@{}c@{}}POMD with \\ nominal transition\end{tabular} \\ \hline
%$\rho = 0.1$           & \textbf{8.89}                                                                          & 7.23                                                                   &  10.01                                               \\ 
%$\rho = 0.2$           & \textbf{6.16}                                                                          & 5.24                                                                   &     10.01                                                                        \\ 
%$\rho = 0.3$           &   \textbf{3.50}                                                                            &   2.81                                                                     &       10.01                                                                      \\ \hline
%\end{tabular}
%\end{table}


%\vspace{-0.5cm}


