% !TEX root = main21neurips-ssp.tex

\section{Experiments}
\label{sec: experiments}

The literature on regret-minimization for the SSP model mostly lacks numerical evaluation except for \cite{tarbouriech2020no}. Standard OpenAI Gym environments \cite{brockman2016openai} are either not designed for the SSP setting (e.g., FrozenLake-v0, CartPole-v1, and MuJoCo), or more suitable for algorithms with function approximation (e.g., Atari and Box2D). In this section, we attempt to design some benchmark environments and compare the performance of our \ssp~algorithm with existing OFU-type algorithms in the literature. Three environments are considered: RandomMDP, GridWorld, and SSP-MountainCar.

\textbf{Description of Environments.} RandomMDP \citep{ouyang2017learning,wei2020model} is an SSP with 8 states and 2 actions whose transition kernel and cost function are generated uniformly at random ($\cmin = 0.04$). 

GridWorld \citep{tarbouriech2020no} is a $3\times 4$ grid (total of 12 states including the goal state) and 4 actions (LEFT, RIGHT, UP, DOWN) with $c(s, a) = 1$ for any state-action pair $(s, a) \in \calS \times \calA$. The agent starts from the initial state located at the top left corner of the grid, and ends in the goal state at the bottom right corner. At each time step, the agent attempts to move in one of the four directions. However, the attempt is successful only with probability 0.85. With probability 0.15, the agent takes any of the undesired directions uniformly at random. If the agent tries to move out of the boundary, the attempt will not be successful and it remains in the same position. 

The SSP-MountainCar environment is a modification of the standard MountainCar-v0 environment \citep{moore1990efficient} and simulates a car positioned between two mountains and wants to drive up the mountain on the right, however, the engine is not powerful enough to ascend directly. It needs to drive back and forth to build adequate momentum. This is a continuous state space SSP model with three actions (LEFT, RIGHT, NEUTRAL). The state is the pair of (position, velocity), where position can take values in $[-1.2, 0.6]$ and velocity can take values in $[-0.07, 0.07]$. The agent suffers a cost of $1$ at each time before reaching the goal. We discretize the state space with the step size of $0.1$ for the position and $0.02$ for the velocity (total of $126 = 18 \times 7$ states). Note that this discretization is only from the perspective of the agent and the underlying dynamics are unchanged. Although the underlying environment is deterministic, the agent observes stochastic transitions due to the discretization. Note that the standard MountainCar-v0 environment artificially terminates the interaction between the agent and the environment after 200 steps and is much simpler than the SSP-MountainCar where the interaction only terminates if the goal is reached. Indeed, the standard RL algorithms \cite{sutton2018reinforcement} (such as Q-learning and SARSA) that work well in the MountainCar-v0 cannot reach the goal even in the first episode in the SSP-MountainCar environment.

In the experiments, we evaluate the frequentist regret of \ssp~for a fixed environment (i.e., the environment is not sampled from a prior distribution). %Thus, the \ssp~algorithm can be viewed as a completely parameter-free algorithm in the numerical results. 
A Dirichlet prior with parameters $[0.1, \cdots, 0.1]$ is considered for the transition kernel, which remain the same across environments and are not tuned as hyper-parameters. Dirichlet is a common prior in Bayesian statistics since it is a conjugate prior for categorical and multinomial distributions.

We compare the performance of our proposed \ssp~against all provable existing online learning algorithms for the SSP problem (\texttt{UC-SSP} \citep{tarbouriech2020no}, \texttt{Bernstein-SSP} \citep{rosenberg2020near}, \texttt{ULCVI} \citep{cohen2021minimax}, and \texttt{EB-SSP} \citep{tarbouriech2021stochastic}). The results are averaged over 10 independent runs. 95\% confidence interval is considered to compare the performance of the algorithms. All the experiments are performed on a 2015 Macbook Pro with 2.7 GHz Dual-Core Intel Core i5 processor and 16GB RAM.


\begin{figure}[t]
	\centering
	\begin{tabular}{cc}
		\includegraphics[width=0.22\textwidth]{Figures/randommdp.pdf} &
		\includegraphics[width=0.22\textwidth]{Figures/randommdptuned.pdf} \\
		\includegraphics[width=0.22\textwidth]{Figures/gridworld.pdf} &
		\includegraphics[width=0.22\textwidth]{Figures/gridworldtuned.pdf}
	\end{tabular}
	\caption{
		Cumulative regret of existing SSP algorithms on RandomMDP (top) and GridWorld (bottom) for $10,000$ episodes. The results are averaged over 10 runs and 95\% confidence interval is shown with the shaded area. Our proposed \ssp~algorithm outperforms all the existing algorithms considerably if the confidence intervals of other algorithms are not tuned (left plots). \ssp~(with no hyper-parameter tuning) has similar performance to OFU algorithms if their confidence intervals are tuned as a hyper-parameter (right plots).}
	\label{fig: plot}
\end{figure}
\begin{figure}[t]
	\centering
	\begin{tabular}{cc}
		\includegraphics[width=0.22\textwidth]{Figures/mountaincarfigure.png} &
		\includegraphics[width=0.22\textwidth]{Figures/sspmountaincar.pdf}
	\end{tabular}
	\caption{
		(left) SSP-MountainCar environment. (right) Average cost per episode of the \ssp~algorithm. OFU algorithms did not learn in reasonable time (and thus not included) due to the large state space.}
	\label{fig: plot mountaincar}
\end{figure}


We compare \ssp~with OFU algorithms in two scenarios. The first scenario, considers the case where the theoretical confidence intervals are used for the OFU algorithms (Figure~\ref{fig: plot} (left)). The second scenario is when a multiplicative coefficient (smaller than 1) is used in front of the confidence intervals for the OFU algorithms to expedite learning (Figure~\ref{fig: plot} (right)). This coefficient is tuned as a hyper-parameter. It can be seen from Figure~\ref{fig: plot} (left) that \ssp~significantly outperforms all the previously proposed algorithms for the SSP problem if the theoretical confidence intervals are used. In particular, it outperforms the recently proposed \texttt{ULCVI} \citep{cohen2021minimax} and \texttt{EB-SSP} \citep{tarbouriech2021stochastic} which match the theoretical lower bound. Our numerical evaluation reveals that the \texttt{ULCVI} algorithm does not show any evidence of learning even after 80,000 episodes (not shown here). Figure~\ref{fig: plot} (right) verifies that the performance of \ssp~(with no hyper parameter tuning) is similar to the tuned OFU algorithms (where confidence interval is tuned as a hyper parameter). %This suggests that the theoretical constants chosen for this algorithm are too conservative.
The poor performance of OFU algorithms ensures the necessity to consider PS algorithms in practice.

The gap between the performance of \ssp~and OFU algorithms is even more apparent in the GridWorld environment which is more challenging compared to RandomMDP. Note that in RandomMDP, it is possible to go to the goal state from any state with only one step. This is since the transition kernel is generated uniformly at random. However, in the GridWorld environment, the agent has to take a sequence of actions to the right and down to reach the goal at the bottom right corner. Figure~\ref{fig: plot} (bottom) verifies that \ssp~is able to learn this pattern significantly faster than OFU algorithms.

Figure~\ref{fig: plot mountaincar} evaluates the performance of \ssp~in the SSP-MountainCar environment which has a much larger state space. The large state space of this environment prevents OFU algorithms from learning in reasonable amount of time (and thus not shown in the figure). However, \ssp~improves quickly after a few episodes.

%Since these plots are generated for a fixed environment (not generated from a prior), we conjecture that \ssp~enjoys the same regret bound under the non-Bayesian setting.

These results confirm the intuition that OFU-type algorithms are too conservative in uncertainty estimation, whereas PS-type algorithms are statistically more efficient and hence perform better empirically across almost all settings.

%\textcolor{red}{We note that Optimism-based algorithms resort to a worst-case mis-estimation that leads to too conservative confidence sets, while PSRL-SSP selects policies according to the probability that they are optimal and the uncertainty is quantified in a statistically efficient way through the posterior distribution. We believe that the extra $\sqrt{S}$ in the regret bound is an artifact of the analysis and removing it is an important open question even for the finite-horizon case.}