In this section, we conduct simple numerical experiments to verify our theoretical results for both full-information and bandit feedback settings.

\subsection{Settings}
We run the simulation on a variant of the standard benchmark for tabular MDP:  ``RiverSwim'' \citep{strehl2008analysis,osband2013more}, illustrated in Fig~\ref{fig: RiverSwim}. 
Briefly speaking, the environment consists of six states and two actions ``left'' and ``right'', i.e., $\statesize = 6$ and $\actionsize = 2$.
Before an episode $\episode$ starts, we decide the loss function $\loss_\episode$ as below.
For any state and action ``left'', the loss is sampled from a normal distribution $\cN\rbr{\mu,\sigma}$ with $\mu=0.7,\sigma=0.25$, and for any state and action ``right'', the loss is sampled from a normal distribution $\cN\rbr{\mu,\sigma}$ with $\mu=0.3,\sigma=0.25$.
Each episode is reset every $\horizontotal = 20$ steps.
In each episode, the agent starts from the left side and executes its policy.
At each step of episode $\episode$, if the agent chooses the action ``left'', it will always succeed. 
Otherwise, it may fail.
Note that the independently identical distributed losses are also one kind of adversarial environment setting, which is valid for our problem setting.
Besides, this environment is actually an episodic adversarial MDP model, which is a specific instance of our loop-free adversarial MDP model.

\begin{figure}[!htbp]
  \centering
  \includegraphics[width=0.8\textwidth]{AISTATS2024/figs/AMDPRiverSwim.png}
  \caption{Modified RiverSwim MDP -- solid and dotted arrows denote the transitions under actions ``right'' and ``left'', respectively.}
  \label{fig: RiverSwim}
\end{figure}

\subsection{Results}
We evaluate both Private-UC-O-REPS under full-information setting and Private Bounded Bandit UC-O-REPS under bandit feedback setting, using different privacy budget $\pripara$ under constraints of JDP and LDP, and also compare them with the corresponding non-private algorithms UC-O-REPS \citep{rosenberg2019onlineamdp} and Bounded Bandit UC-O-REPS \citep{rosenberg2019onlinessp}, respectively.
We set $\alpha=0.01$ for Private Bounded bandit UC-O-REPS.
Besides, we set all the parameters in our proposed algorithms as the same order as the theoretical results and tune the learning rate $\FTRLpara$ and the scaling of the confidence interval.
We run 5-20? independent experiments, each consisting of $\episodetotal = 10^4$ episodes.
We plot the average cumulative regret along with the standard deviation for each setting, as shown in Fig. 

\begin{figure}[!htbp]
\centering
\begin{subfigure}{0.46\textwidth}
  \includegraphics[width=\linewidth]{AISTATS2024/figs/111.jpg}  
  % \caption{Put your sub-caption here}
  \label{fig:Private-UC-O-REPS}
\end{subfigure}
\begin{subfigure}{0.46\textwidth}
  \includegraphics[width=\linewidth]{AISTATS2024/figs/222.jpg}  
  % \caption{Put your sub-caption here}
  \label{fig: Private Bounded Bandit UC-O-REPS}
\end{subfigure}
\caption{Cumulative regret vs. Episode under Private-UC-O-REPS and Private Bounded Bandit UC-O-REPS}
\label{fig: Simulation}
\end{figure}

As indicated by our theoretical analysis, in both the full-information and bandit feedback settings, non-private algorithms demonstrate the best performance, and the privacy cost, particularly under JDP requirements, becomes nearly negligible as the number of episodes increases. 
However, within the constraints of LDP, the cost of privacy remains notably high, requiring an extended period for the algorithm to converge to near-optimal policies.
Moreover, it's important to highlight that the cost of privacy intensifies as the protection level increases, signifying a decrease in $\pripara$. 
Additionally, the results confirm that performance is adversely affected due to the limited information available in the bandit setting as compared to the full-information setting.
These findings reaffirm the consistency of our simulation results with our theoretical analyses.