\section{Introduction}
Reinforcement learning (RL) is a prominent sequential decision-making framework, where an agent learns to minimize its long-term loss by interacting with an environment\footnote{We %use the term ``losses'' instead of ``rewards'' 
consider the setting of ``losses'' instead of ``rewards'' to be consistent with the adversarial online decision-making literature \citep{jain2012differentially,jin20c}. One can translate between losses and rewards by simply taking negation.}.
It has gained remarkable attraction in real-world applications across several fields such as
% finance \cite{xu2022towards}, 
healthcare \citep{gottesman2019guidelines}, online recommendation \citep{afsar2022reinforcement}, and language model \citep{ouyang2022training}.
However, in these applications, the learning agent continuously improves its performance by learning from users' personal data and feedback, which usually contain sensitive information.
% Taking the recommendation system as an instance, the agent recommends items (corresponding to actions in MDPs) according to users' search input (corresponding to the states in MDPs), and improves its performance based on users' rating and feedback (corresponding to the losses in MDPs).
Without privacy protection mechanisms in place, the learning agent can memorize information of users' interaction history \citep{carlini2019secret}, which makes the learning agent vulnerable to various privacy attacks \citep{lei2023new}.

Over the past decade, \textit{differential privacy} (DP) \citep{dwork2006calibrating} has been extensively applied in various private decision-making settings, e.g., private multi-armed bandits \citep{basu2019differential,tao2022optimal}.
Under DP, the learning agent collects users' raw data to train algorithms while ensuring that the output is indistinguishable from its output returned by an alternative universe where any individual user is replaced, thereby mitigating the aforementioned privacy risk.
Despite such a promise, %recent works
\citep{shariff2018differentially} and \citep{vietri2020private} show that standard DP is incompatible with sub-linear regret performance for contextual bandits and RL. 
Therefore, they embrace \textit{joint differential privacy} (JDP) \citep{kearns2014mechanism}, a variant of DP, ensuring that the output of all other users will not leak much information about any specific user.
In some situations, they even adopt \textit{local differential privacy} (LDP) \citep{duchi2013local} in private RL \citep{garcelon2021local} due to its stronger privacy guarantee, where each user's raw data must be privatized before being sent to the learning agent.

% Over the past decade, \textit{differential privacy} (DP) \citep{dwork2006calibrating} has become a standard tool in designing such private decision-making algorithms, which has been extensively utilized in multi-armed bandits \cite{basu2019differential,tao2022optimal}.
% Under DP, the learning agent collects users' raw data to train its algorithm while ensuring that the output is indistinguishable from its output returned by an alternative universe where any individual user is replaced, thereby mitigating the aforementioned privacy risk.
% However, recent works \cite{shariff2018differentially} and \cite{vietri2020private} show that standard DP is incompatible with sub-linear regret bound for contextual bandits and RL. 
% Therefore, they embrace \textit{joint differential privacy} (JDP) \cite{kearns2014mechanism}, a variant of DP, which ensures that the output of all other users will not leak much information about any specific user.
% In some situations, users are even unwilling to share their raw data with the learning agent.
% So \textit{local differential privacy} (LDP) \citep{duchi2013local} has been adopted in private RL \cite{garcelon2021local} due to its stronger privacy guarantee, where each user's raw data must be privatized before being sent to the learning agent.

% Over the past decade, \textit{differential privacy} (DP) \cite{dwork2006calibrating} has become a standard tool in designing private sequential decision-making algorithms, which has been extensively studied in various settings, including online convex or linear optimization \cite{jain2012differentially,agarwal2017price}, multi-armed bandit under stochastic or adversarial environments \cite{tossou2017achieving,tao2022optimal}, etc.
% Under DP, a learning agent collects users' raw data to train its algorithm while ensuring that the output will not reveal users' sensitive information.
% However, recent works \cite{shariff2018differentially} show that standard DP is incompatible with sub-linear regret bound for contextual bandits.
% Therefore, a relaxed variant of DP: \textit{joint differential privacy} (JDP) \cite{kearns2014mechanism} is proposed, and such notion has been studied extensively in bandits problems \cite{shariff2018differentially,garcelon2022privacy}.
% In addition, in some situations, users are unwilling to share their raw data with the learning agent.
% So another variant of DP, \textit{local differential privacy} (LDP) \cite{duchi2013local} has gained increasing attention due to its stronger privacy guarantee, where each user's raw data is privatized before being sent to the learning agent.
% LDP has also been studied in various bandit settings recently \cite{zheng2020locally,han2021generalized}.

% In addition to the vast amount of work in private bandit algorithms, differential privacy has also been used in the RL problem under stochastic environment recently. \cite{vietri2020private} first defined JDP and proposed PUCB with regret bound and JDP guarantee.
% \cite{garcelon2021local} introduced the LDP notion and designed LDP-OBI with regret bound and LDP guarantee.
% \cite{chowdhury2022differentially} provided general frameworks for both policy optimization and value iteration methods for this problem, and \cite{qiao2023near} improved the regret bound with a tighter confidence bound.

% However, private RL is still far from well-understood. 
% All of the previous work assumes that the losses are generated by a stochastic distribution that is stationary throughout the learning process. 
% But this assumption is quite restrictive for plenty of real-world systems such as search advertisement \cite{rappaport2007lessons}, medicine \cite{gottesman2019guidelines}, and portfolio management \cite{luo2018efficient}, where the learning agent interacts with dynamic and even adversarial environments.
% That is, the loss function may change over time and even be chosen by a potential adversary. 
% We still take the mentioned recommendation system as an example, the learning agent's goal is to minimize the total bad rating from users over time, while in reality, the same recommendation according to the same search input for different users may result in totally different feedback. 

% Motivated by these facts, in this paper, we focus on one fundamental model in online RL under DP constraints, i.e., adversarial Markov decision processes (AMDPs) \cite{even2009online}.
% Specifically, we consider a general setting where the interaction proceeds in episodes with a fixed horizon.
% Within each episode, the agent sequentially observes the users' current state, selects an action, suffers the loss corresponding to the chosen state-action pair, and transits to the next state according to the unknown transition function.
% At the end of this episode, the agent observes feedback, -- the loss for every state-action pair in \textit{full-information} setting, or the loss for each visited state-action pair in \textit{bandit} setting.
% Between episodes, The loss function can change arbitrarily.
% The goal is to minimize his regret: the difference between the total suffered loss and the total loss of an optimal fixed policy.

Nonetheless, private RL is still far from being well-understood. 
All of the previous work assumes that the losses are generated by a stochastic distribution that is stationary throughout the learning process. 
This assumption is quite restrictive for plenty of real-world systems since the loss function may depend on additional variables controlled by a complex and unpredictable part of the environment. 
These extra variables may be challenging to model and predict using a stochastic distribution and only impact the loss incurred by the user.
Specifically, the loss function might unpredictably vary across episodes and even be generated by a potential adversary.
In these scenarios (with privacy concerns), modeling loss functions as adversarial would be more relevant; examples include recommendation system \citep{zhou2019privacy}, medicine trials \citep{liu2020blockchain}, and portfolio management \citep{luo2018efficient}.
%More precisely, 
For instance, in recommendation systems, the agent recommends items (corresponding to actions) according to users' search input (corresponding to states) and improves its performance based on users' rating (corresponding to rewards), and the rating may depend on some complex and hard-to-model historical variable of each user and reflect different preferences. 

Motivated by these facts, in this paper, we focus on one fundamental model in online RL under DP constraints, i.e., private adversarial MDPs \citep{even2009online}, where the transition function is unknown and stochastic, but where the loss function can be arbitrarily determined by an oblivious adversary.
% change arbitrarily over time.
To solve this problem, we are required to design private algorithms in such non-stationary environments, especially for adversarial loss functions.
Moreover, we must deal with the dual complexities of adversarially changing and noisy interaction histories, which %makes 
makes it challenging to utilize and generalize past experiences and adapt to evolving circumstances. 
To the best of our knowledge, this paper is the first to consider adversarial MDPs with both JDP and LDP guarantees. 
Our contributions are summarized as follows.
% To solve this problem, we confront several challenges. 
% First, the learning agent is required to protect the privacy of AMDPs, especially when facing the dynamically changing loss and the stochastic transition function.
% Furthermore, the agent must grapple with the dual complexities of adversarially chosen and noisy interaction histories, which makes it challenging to utilize and generalize past experiences and adapt to evolving circumstances. To the best of our knowledge, we are the first to consider AMDPs with both JDP and LDP guarantees.

$1.$ We %start
begin with the full-information setting where the loss for \emph{every} state-action pair is observed after each interaction. 
We present a general algorithm, ``Private-UC-O-REPS'', which uses tighter confidence bounds on components of the transition function than existing ones in the adversarial MDP literature, and enjoys refined regret bounds under JDP and LDP constraints by adopting our Central and Local Privatizer, respectively. Notably, these bounds are problem-dependent in the sense that they make appear a notion of effective support of the underlying transition function, and adapt to the difficulty of the transition dynamics.
Further, they match the best bounds of non-private algorithm \citep{rosenberg2019onlineamdp} in the worst case.


$2.$ We then consider the bandit setting where only the loss of each \emph{visited} state-action pair is revealed after each interaction.
We propose the ``Private UOB-LBPS'' algorithm, which involves a novel private and optimistic loss estimator, and a log-barrier regularizer for private OMD making the algorithm more stable. 
Meanwhile, we obtain near-optimal problem-dependent regret bounds under both JDP and LDP constraints. 
In particular, they also match the near-optimal regret bounds of the best non-private algorithm \citep{jin20c} in the worst case.
    
$3.$ We introduce novel Privatizers designed to privatize both the transition function and the adversarial losses under full-information and bandit-feedback settings.
These Privatizers satisfy several %nice
key properties (see Assumptions \ref{assp: private counts}, \ref{assp: Private loss in full-information setting} and \ref{assp: Private loss in bandit feedback setting} for details), which play a critical role in the analysis to help obtain privacy guarantee and the regret bounds, and could be of interest beyond this work. 

% \begin{itemize}
%     \item \textbf{Full-information Setting}: 
%     We present a general framework, "Private-UC-O-REPS",
%     % , for designing private occupancy-measure-based policy search algorithms.
%     which enables us to obtain near-optimal regret bounds under JDP and LDP requirements by instantiating it with the Central-Privatizer and the Local-Privatizer, respectively. 
%     Besides, the regret bounds also match the best of known non-private algorithm \cite{rosenberg2019onlineamdp}.
%     \item \textbf{Bandit Setting}: We present the "Private-Bounded-Bandit-UC-O-REPS" framework, assuming that any state is reachable under any policy with probability $\alpha>0$. 
%     This framework also allows us to obtain near-optimal regret bounds under both JDP and LDP constraints using a unified analysis technique.
%     \item \textbf{Privatizers Design}: We propose novel Privatizers to privatize the stochastic transitions and adversarial losses.
%     They satisfy several nice properties (see Assumption \ref{assp: private counts} for detail) which play a pivotal role in obtaining our regret bounds.
% \end{itemize}

% \begin{itemize}
%     \item We start with the full-information setting where the adversarial loss functions are observed after each interaction. 
%     We present a general framework, "Private-UC-O-REPS", for designing private occupancy-measure-based policy search algorithms.
%     This framework enables us to obtain near-optimal regret bounds under JDP and LDP requirements by instantiating it with the Central-Privatizer and the Local-Privatizer, respectively, 
%     and the regret bounds also match the best-known non-private algorithm under full-information setting \cite{rosenberg2019onlineamdp}.
%     \item For bandit information, we present the "Private-Bounded-Bandit-UC-O-REPS" framework, assuming that any state is reachable under any policy with probability $\alpha>0$. 
%     This framework also allows us to obtain near-optimal regret bounds under both JDP and LDP constraints using a unified analysis technique.
%     \item We propose novel techniques and post-processing methods to design Privatizers that privatize the visitation numbers and adversarial losses under both full-information and bandit settings separately, {\color{red}satisfying Assumption \ref{assp: private counts}, which helps obtain our regret bounds. }
% \end{itemize}
We summarize our theoretical results in Table \ref{table: contributions}. 
Due to space limitations, algorithms and all proof details are included in the appendix.

\begin{table*}
\renewcommand{\arraystretch}{2}
\small
\centering
\begin{tabular}{|c|c|c|c|c|}
\hline
Feedback & Algorithm & Regret ($\epsilon$-JDP) & Regret ($\epsilon$-LDP) & Lower bound without Privacy \\ \hline
Full-info & Theorem \ref{thm: Regret bound of Private UC-O-REPS} 
& $\widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2}{\pripara}}$ 
& $\widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}$ 
& $\Omega(\sqrt{X A H K})$\\ \hline
Bandit & Theorem \ref{thm: Regret bound of Private UOB-LBPS} 
& $\widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\actionsize\horizontotal\sqrt{\statesize^3\episodetotal}}{\pripara}}$
& $\widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^4\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}$ 
& $\Omega\left(\sqrt{X A H^2 K}\right)$  \\ \hline
% Full-info & Theorem \ref{thm: Regret bound of Private UC-O-REPS} 
% & $\cO\rbr{\horizontotal\statesize\sqrt{\actionsize\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2}{\pripara}}$ 
% & $\cO\rbr{\horizontotal\statesize\sqrt{\actionsize\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}$ 
% & $\Omega(\sqrt{X A H K})$\\ \hline
% Bandit & Theorem \ref{thm: Regret bound of Private UOB-LBPS} 
% & $\cO\rbr{\horizontotal\statesize\sqrt{\actionsize\episodetotal} + \frac{\horizontotal^2\sqrt{\episodetotal}}{\pripara}}$
% & $\cO\rbr{\horizontotal\statesize\sqrt{\actionsize\episodetotal} + \frac{\statesize^4\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}$ 
% & $\Omega\left(\sqrt{X A H^2 K}\right)$  \\ \hline
\end{tabular}
\caption{
Regret comparisons for private online RL on loop-free adversarial MDP\protect\footnotemark~under both full-information and bandit settings with $\pripara$-JDP and $\pripara$-LDP guarantees. 
% Here $\statesize,\actionsize,\episodetotal,\horizontotal$ refer to the size of state space, the size of action space, the total number of episodes, and the number of steps per episode, respectively.
% $\pripara > 0$ is the desired privacy level. 
$\cumlocsupport:= \sum_{\horizon=0}^{\horizontotal-1} \sqrt{\sum_{(\state,\action)\in\statespace_\horizon\times\actionspace} \locsupport_{\state,\action}}$ denotes the cumulative effective support, where
$\locsupport_{\state,\action}:=[\sum_{\state'\in\statespace_{\horizon(\state)+1}} \sqrt{\transeasy(\state'\vert\state,\action)(1-\transeasy(\state'\vert\state,\action))}]^2$ denotes the effective support of $\transeasy(\cdot\vert\state,\action)$. 
Finally, the lower bound follows from \cite{jaksch2010near,jin2018q}.
Note that $\cumlocsupport\leq\statesize\sqrt{\actionsize}$ always holds, implying that our bounds are never worse than $\widetilde{\cO}(\horizontotal\statesize\sqrt{\actionsize\episodetotal})$ of non-private setting \citep{rosenberg2019onlineamdp,jin20c}.
}
\label{table: contributions}
\end{table*}
\footnotetext{Under the loop-free tabular MDP in this paper, the result in episodic MDP \citep{jin2018q} will have additional $\horizontotal$ dependence.}

\subsection{Related Work}
Private online decision-making in adversarial environments has been studied for over a decade, with \textit{follow-the-leader} type algorithms commonly employed to address these challenges.
Examples of such scenarios include private online convex learning \citep{jain2012differentially,agarwal2023differentially}, private expert prediction \citep{agarwal2017price,asi2023private},
and private (contextual) adversarial bandits \citep{tossou2017achieving,agarwal2017price,zheng2020locally}, etc.
% Besides, \cite{basu2019differential} provides a comprehensive account of differential privacy definitions used in the bandit literature.

Regarding private RL with regret guarantees, previous research primarily focused on MDPs in stochastic stationary environments.
Notable approaches include private %value-iteration-based 
value-based algorithms \citep{vietri2020private,garcelon2021local,qiao2023near,qiao2023offline} and private policy-optimization-based algorithms \citep{chowdhury2022differentially,wu2023differentially}, particularly in tabular MDPs. 
Initial investigations into private linear (mixture) MDPs were also undertaken in \cite{luyo2021differentially,ngo2022improved,zhou2022differentially,liao2023locally}.
However, %these 
the machinery and techniques used in these papers cannot be directly applied in an adversarial environment.
% Private value-iteration-based algorithms \citep{vietri2020private,garcelon2021local,qiao2023near,qiao2023offline} and private policy-optimization-based algorithms \citep{chowdhury2022differentially,wu2023differentially} are both proposed to settle this issue in tabular MDP.
% Some initial works on private linear (mixture) MDP also include \cite{luyo2021differentially,ngo2022improved,zhou2022differentially}.

% Regarding private RL, existing papers focusing on private reinforcement learning all studied the MDP under stationary environments.
% For tabular MDP, \cite{vietri2020private,garcelon2021local,qiao2023near,qiao2023offline} focus on value-iteration-based regret minimization algorithms under privacy constraints, and policy-optimization-based algorithms are also introduced in \cite{chowdhury2022differentially} to improve computational efficiency.
% More recently, \cite{wu2023differentially} studied a special case, i.e., private RL with heavy-tailed rewards.
% For linear (mixture) MDP, value-iteration-based algorithms with function approximation are also proposed in \cite{luyo2021differentially,ngo2022improved,zhou2022differentially}. 

% Adversarial MDP has been extensively studied to deal with the non-stationary environment in both known and unknown transition functions, full-information and bandit feedback settings.
% While a number of algorithms with regret guarantees have been proposed for this problem recently \citep{rosenberg2019onlineamdp,rosenberg2019onlinessp,jin20c,luo2021policy,dai2022follow,zhao2023learning}, we are not aware of any existing work on private adversarial MDP.
% Thus, our work takes the first step towards a unified framework for private AMDP with general privacy and regret guarantees.
Adversarial MDPs have received extensive attention, addressing non-stationary environments with both known and unknown transition functions, and considering both full-information and bandit feedback settings.
While a number of algorithms with regret guarantees have been proposed for this problem recently \citep{rosenberg2019onlineamdp,rosenberg2019onlinessp,jin20c,luo2021policy,zhao2023learning}, we are not aware of any existing works on private adversarial MDPs.
%Thus, we {\color{purple}believe this paper} takes the initial steps towards unified frameworks for adversarial MDPs with privacy and regret guarantees.
Thus, we believe this paper makes the first attempts at designing algorithms for adversarial MDPs with privacy and regret guarantees simultaneously.

% \cite{zimin2013online} first assume known transition and propose the O-REPS algorithm which applies Online Mirror Descent over the space of occupancy measures, which we also applied in our paper.
% Similar ideas are also applied under unknown transition and full information setting \cite{rosenberg2019onlineamdp}, which achieved the best-known regret bound $\widetilde{O}\rbr{\sqrt{\statesize^2\actionsize\horizontotal^2\episodetotal}}$.
% Under bandit setting, \cite{rosenberg2019onlinessp} extended the idea and obtained regret bound $\widetilde{O}\rbr{\frac{\sqrt{\statesize^2\actionsize\horizontotal^2\episodetotal}}{\alpha}}$ assuming that all states are reachable with probability $\alpha>0$ under any policy. 
% Later, \cite{jin20c} achieved best-known regret bound $\widetilde{O}\rbr{\sqrt{\statesize^2\actionsize\horizontotal^2\episodetotal}}$ with biased loss estimate and tighter confidence set.
% At the same time, \cite{shani2020optimistic,luo2021policy} proposed policy optimization-based methods, \cite{dai2022follow} proposed Follow-the-Perturbed-Leader algorithms to solve the expensive computation problem in occupancy-measure-based methods, and achieved the same performance as occupancy-measure-based algorithms.