\section{Extended Related Work}
\noindent\textbf{Regret minimization for adversarial MDPs.}
Over the past decade, research on adversarial (tabular) Markov Decision Processes has covered various scenarios, including known and unknown transition functions, as well as full-information and bandit feedback settings.
In a scenario with a known transition function, \cite{zimin2013online} introduces the O-REPS algorithm, which employs Online Mirror Descent over the space of occupancy measures. 
This approach yields regret bounds of $\widetilde{O}(\horizontotal\sqrt{\episodetotal})$.
To tackle the challenges posed by unknown transition functions, \cite{rosenberg2019onlineamdp} combine confidence sets and Online Mirror Descent. 
This hybrid approach achieves the best-known regret bound of $\widetilde{O}(\sqrt{\statesize^2\actionsize\horizontotal^2\episodetotal})$ under the full-information setting. 
In the bandit setting, \cite{rosenberg2019onlinessp} develop inverse importance-weighted loss estimators and obtain regret bounds of $\widetilde{O}(\frac{\sqrt{\statesize^2\actionsize\horizontotal^2\episodetotal}}{\alpha})$ under the $\alpha$-reachability assumption.
Building on these works, \cite{jin20c} further improves the regret bounds by introducing biased and optimistic loss estimators along with a tighter confidence set. 
These advancements lead to the best regret bound of $\widetilde{O}(\sqrt{\statesize^2\actionsize\horizontotal^2\episodetotal})$. 
Additionally, policy-optimization-based methods have been developed by \cite{shani2020optimistic} and \cite{luo2021policy}. 
In particular, \citep{luo2021policy} matches the best regret bound achieved through the OMD methods.
% and  \cite{dai2022follow} proposed Follow-the-Perturbed-Leader algorithms to address the high computational demands of occupancy-measure-based methods, and they achieved the same performance as occupancy-measure-based algorithms.

\vspace{1ex}
\noindent\textbf{Private online learning.}
Private online learning has been a subject of extensive research for over a decade, and \textit{follow-the-leader} type algorithms have been employed in various scenarios. 
For instance, \cite{guha2013nearly} introduce a private \emph{follow-the-approximate-leader} method for online convex learning.
Additionally, \cite{agarwal2017price} and \cite{kairouz2021practical} propose private \emph{follow-the-regularized-leader} algorithms for online linear optimization and online federated learning, respectively.
In the context of private bandit learning, \cite{tossou2017achieving} design a private variant of the EXP3 algorithm for adversarial bandits, while \cite{agarwal2017price} and \cite{zheng2020locally} explore private adversarial linear bandits and private convex bandit learning. 
Their works offered general reduction frameworks that could achieve nearly optimal regret. 
Furthermore, a substantial body of research has focused on private online learning with contextual information, spanning areas like private contextual bandit problems \cite{shariff2018differentially, zheng2020locally, chowdhury2022shuffle, charisopoulos23a}, as well as private stochastic reinforcement learning \cite{vietri2020private, garcelon2021local, qiao2023near, liao2023locally}.
Notably, despite the advancements in private online learning, none of the existing work has addressed the specific challenges posed by private reinforcement learning in the context of Adversarial Markov Decision Processes. 
This uncharted territory introduces the unique challenge of private online learning within an adversarial environment characterized by contextual dynamics.

% Private online learning has been extensively studied for over a decade, including private online convex optimization (DP-OCO) \cite{jain2012differentially,guha2013nearly,jain2014near,kairouz2021practical,agarwal2023differentially},
% private online linear optimization and experts \cite{agarwal2017price,asi2023private},
% private stochastic bandits \cite{tossou2016algorithms,sajed2019optimal,basu2019differential,tenenbaum2021differentially,tao2022optimal},
% private adversarial bandits, \cite{tossou2017achieving,agarwal2017price,zheng2020locally},
% and contextual (linear) bandits \cite{shariff2018differentially,zheng2020locally,chowdhury2022shuffle,charisopoulos23a}.

% \vspace{1ex}
% \noindent\textbf{Private online reinforcement learning.}
% In the RL setting, to the best of our knowledge, existing works focusing on private reinforcement learning all studied stochastic MDPs.
% Under the tabular MDP, \cite{vietri2020private,garcelon2021local,qiao2023near} focus on value-iteration-based regret minimization algorithms to achieve privacy guarantees (JDP or LDP, or both).
% Besides, policy-optimization-based algorithms are also introduced in \cite{chowdhury2022differentially} to improve computational efficiency.
% \cite{wu2023differentially} studied a special case, i.e., episodic RL with heavy-tailed rewards, and obtained near-optimal regret guarantee.
% Under linear MDP or linear mixture MDP, value-iteration-based algorithms with high-dimensional statistics are also proposed in \cite{luyo2021differentially,ngo2022improved,zhou2022differentially}. 