\section{Full-Information Setting} \label{sec: full-information setting}
Similar to the non-private algorithms \citep{rosenberg2019onlineamdp}, we propose a general framework, \textit{Private Upper Confidence Online Relative Entropy Policy Search} (Private-UC-O-REPS). 
The core idea is to solve an online convex optimization problem within the occupancy measure space, which combines two key elements: tighter confidence bounds on components of the transition estimate, and i.i.d. perturbations with bounded maxima on private cumulative loss functions.
The complete algorithm details can be found in the appendix, and we give a brief description below.

In episode $\episode$, we first utilize the private counters $\visitxatotalprieasy_{\episode+1}$ to establish a confidence set $\transspace_{\episode+1}$ that contains the true transition function with high probability, whose radius shrinks as more data is collected. 
Then, the occupancy measure $\occmeasure_{\episode+1}$ is updated by solving an online optimization problem within %the established confidence set 
$\transspace_{\episode+1}$ using the Follow-the-Regularized-Leader (FTRL) method \citep{hazan2016introduction}. 
This approach is employed due to our utilization of private cumulative loss $\losscumprieasy_{\episode+1}$.
Finally, the induced policy $\policy^{\occmeasure_{\episode+1}}$ is chosen and executed in the next episode. 

Specifically, the confidence set $\transspace_{\episode}$ is defined as
% \footnote{In definition of the confidence set (Eq.~\eqref{eq: confidence set of transition}), there is an implicit constraint on $\transeasy\rbr{\cdot\vert\state,\action}$ being a valid distribution over states in $\statespace_{\horizon\rbr{\state}+1}$ for each $\rbr{\state,\action}$ pair, which following the loop-free property.} 
% \begin{equation}\label{eq: confidence set of transition} % L1-norm version
% \begin{aligned}
%     \transspace_{\episode+1} \! = \! \Big\{ \transeasy :& \nbr{\transeasy \rbr{\cdot \vert \state, \action} \! - \! \transprieasy_{\episode+1} \rbr{\cdot \vert \state, \action} }_1 \! \\
%     &\leq \! \beta_{\episode+1}\rbr{\state,\action},  
%     \forall \rbr{\state,\action}\in\statespace \times\actionspace \Big\},
% \end{aligned}
% \end{equation}
% with $\confnormtrans{\rbr{\state,\action}} \!:=\! \frac{\confconstant}{\sqrt{\max\cbr{1,\visitxatotalpri}}} + \frac{2\statesize_{\horizon+1}\confcountxax}{\max\cbr{1,\visitxatotalpri}}$, where $\confconstant = \sqrt{2\rbr{\statesize_{\horizon+1}\ln2 + \ln\frac{\statesize\actionsize\episodetotal}{\delta}}}$ and $\horizon=\horizon\rbr{\state}$.
% $\transspace_\episode$ is constructed using time-uniform Weissman's inequality and satisfies:
\begin{equation}\label{eq: confidence set of transition} % point-wise confidence
\begin{aligned}
    \transspace_{\episode} \! = &  \Big\{\! \transeasy\!\in\!\triangle_{\cX\times\cA}^\cX : \big|(\transeasy \! - \! \transprieasy_{\episode}) \rbr{\state' \vert \state, \action} \big|  \leq  \confpwtrans_{\episode}\!\rbr{\state'\vert\state,\action}, \\
    &  
    \forall \rbr{\state,\action,\state'}\in\statespace_\horizon \times\actionspace \times\statespace_{\horizon+1},\horizon\in\sbr{\horizontotal} \Big\},
\end{aligned}
\end{equation}
% \sadegh{In the def of $\cP_k$, we must add something to indicate the $P$ is a distribution. Try to define the simplex of distributions over $\cX$, for example. \sj{We have defined this distribution of $P$ in the first sentence of paragraph 2.1 -- adversarial MDPs.}}
where $\triangle_{\cX\times\cA}^\cX$ denotes the set of all transition functions for the state-action space $\cX\times\cA$, and where
the confidence width associated to $(\state,\action,\state')$ is defined as $\confpwtrans_{\episode}\!\rbr{\state'\vert\state,\action} \! = \! \min\Bigg\{\! 1, \sqrt{\frac{2\transprieasy_\episode\rbr{\state'\vert\state,\action}\rbr{1-\transprieasy_\episode\rbr{\state'\vert\state,\action}} \ln\iota}{\visitxatotalpri}} + \! \frac{4\confcountxax+7\ln\iota}{\visitxatotalpri} \Bigg\}$, with $\iota\! =\! \frac{\statesize\actionsize\episodetotal}{\delta}$ for parameter $\delta\!\in\!(0,1)$.
We have the following lemma, thanks to Bernstein-type concentration: 
%By Bernstein-type inequality and union bounds, we have the following:
\begin{lemma}
\label{lemma: concentration of private transition error}
Let $K>0$. Then, with $\transeasy \in \transspace_\episode$ uniformly over all $\episode \in [K]$. 
\end{lemma}
Moreover, one can show that the confidence bound above is strictly tighter than those used in \cite{rosenberg2019onlineamdp,rosenberg2019onlinessp,jin20c}, which proves instrumental in obtaining problem-dependent 
%is important for getting our problem-dependent 
regret bounds. 
% Following the argument in \cite{neu2012adversarial}, it can be shown that $\occmeasureset\rbr{\transspace_\episode}$ also contains $\occmeasureset\rbr{\transeasy}$ with high probability, as a result of Lemma \ref{lemma: concentration of private transition error}.

Different from the non-private adversarial MDPs, we follow the FTRL method to choose the occupancy measure $\occmeasure_\episode$, which is a standard technique to tackle the online optimization problem, while striking a balance between exploiting past knowledge and exploring new options.
Formally, given a parameter $\FTRLpara>0$,
\begin{equation}
\label{eq: update occupancy measure full info}
    \occmeasure_{\episode+1} = \argmin_{\occmeasure \in \occmeasureset\rbr{\transspace_{\episode+1}}} \inner{\losscumprieasy_{\episode+1}}{\occmeasure} + \frac{1}{\FTRLpara}\regularizer\rbr{\occmeasure},
\end{equation}
% \XD{In what sense is this entropy function ``generalized''? (3.3) is a standarad shannon entropy function, no? Is a minus sign missing?}
where we use the negative entropy regularizer,
\begin{equation}
\label{eq: regularizer}
    % \regularizer\rbr{\occmeasure} = \sum_{\rbr{\state,\action}\in \statespace\times\actionspace} \occmeasure\rbr{\state,\action} \ln \occmeasure\rbr{\state,\action}.
    \regularizer\!\rbr{\occmeasure} \!= \!\! \sum_{\horizon=0}^{\horizontotal-1} \! \sum_{\state\in\statespace_{\horizon}} \! \sum_{\action\in\actionspace}\sum_{\state'\in\statespace_{\horizon+1}} \!\!\! \occmeasure\rbr{\state,\!\action,\!\state^\prime} \ln \occmeasure\rbr{\state,\!\action,\!\state^\prime}.
\end{equation}
Note that the update can be implemented efficiently by solving an unconstrained optimization problem which has a closed-form solution, and then solving a convex projection problem which can be solved in polynomial time. (See Appendix \ref{subsec: Updating Occupancy Measure - full-info} for details.)

To privatize the loss function and achieve optimal regret guarantee in private online learning, 
% we introduce an assumption on the Privatizer of the loss function.
we introduce one general assumption on the Privatizer of the loss function, which will be satisfied by our design in Section \ref{sec: privacy and regret guarantees}.

\begin{assumption}[Private loss in full-information setting] \label{assp: Private loss in full-information setting}
For any privacy budget $\pripara >0$ and all $(\state,\action,\episode)$, $\noise\!:=\!\losscumpri \!\!- \!\losscum$ are i.i.d.~random variables, satisfying
$\expect\sbr{\max_{\state,\action} \noise - \min_{\state,\action} \noise} \!\leq\! \conflossf$, for some $\conflossf > 0$. 
\end{assumption}

Assumption \ref{assp: Private loss in full-information setting} guarantees i.i.d perturbations with bounded maxima on cumulative loss, which can convert the effect of perturbed loss on regret bound to an additive and bounded bias term in the regret bound.

To provide a problem-dependent regret, we %introduce 
recall from \citep{bourel2020tightening} the notion of \emph{effective support}, which for a pair $(\state,\action)$ is defined as 
$\locsupport_{\state,\action}:=\rbr{\sum_{\state'\in\statespace_{\horizon(\state)+1}} \sqrt{\transeasy\rbr{\state'\vert\state,\action}\rbr{1-\transeasy\rbr{\state'\vert\state,\action}}}}^2.$ 
Further, the \emph{cumulative effective support} is defined as $\cumlocsupport:= \sum_{\horizon=0}^{\horizontotal-1} \sqrt{\sum_{(\state,\action)\in\statespace_\horizon\times\actionspace} \locsupport_{\state,\action}}$.
Both %parameters 
notions characterize the local structure and difficulty of the MDPs, which are always more refined than the worst case.
As \cite{bourel2020tightening} show, $\locsupport_{\state,\action}$ is controlled by the number $G_{\state,\action}$ of successor states of $\rbr{\state,\action}$\footnote{For a pair $\rbr{\state,\action}$, we define $G_{\state,\action}:=|\text{supp}\rbr{\transeasy\rbr{\cdot\vert\state,\action}}|$.}, 
and one has: %obvious observations are as follows,
$\locsupport_{\state,\action}\leq G_{\state,\action} - 1\leq  \statesize_{\horizon(\state)+1} - 1$ and 
$\cumlocsupport \leq \statesize\sqrt{\actionsize}$.

% To provide a problem-dependent regret, we introduce the notion of \emph{effective support} of a pair $(\state,\action)$ as $\locsupport_{\state,\action}$,  
% $\locsupport_{\state,\action}:=\rbr{\sum_{\state'\in\statespace_{\horizon(\state)+1}} \sqrt{\transeasy\rbr{\state'\vert\state,\action}\rbr{1-\transeasy\rbr{\state'\vert\state,\action}}}}^2.$ 
% And the \emph{cumulative effective support} as $\cumlocsupport$, $\cumlocsupport:= \sum_{\horizon=0}^{\horizontotal-1} \sqrt{\sum_{(\state,\action)\in\statespace_\horizon\times\actionspace} \locsupport_{\state,\action}}$.
% Both parameters describe the local structure of the MDPs, which are always more precise than the worst case.
% As \cite{bourel2020tightening} shows, $\locsupport_{\state,\action}$ is controlled by the number $G_{\state,\action}$ of successor states of $\rbr{\state,\action}$,\footnote{For a pair $\rbr{\state,\action}$, we define $G_{\state,\action}:=|\text{supp}\rbr{\transeasy\rbr{\cdot\vert\state,\action}}|$.}%
% and obvious observations are as follows,
% $(1)$ $\locsupport_{\state,\action}\leq G_{\state,\action} - 1\leq  \statesize_{\horizon(\state)+1} - 1$,
% $(2)$ $\cumlocsupport \leq \statesize\sqrt{\actionsize}$.

The following theorem presents a general regret bound for Private-UC-O-REPS when instantiated with any Privatizer that satisfies Assumption \ref{assp: private counts} and Assumption 
\ref{assp: Private loss in full-information setting}.
\begin{theorem}
\label{thm: Regret bound of Private UC-O-REPS}
    Fix any $\pripara>0$ and $K>1$, and set $\FTRLpara = \sqrt{\frac{\ln\rbr{\statesize\actionsize/\horizontotal}}{\episodetotal}},\delta=\frac{\statesize\actionsize}{\episodetotal}$. 
   Under Assumptions \ref{assp: private counts} and %Assumption
    \ref{assp: Private loss in full-information setting}, the regret of Private-UC-O-REPS is
    % \begin{equation}\notag
    % \begin{aligned}
    %     \expect\sbr{\regret} \leq \cO\Bigg(& \horizontotal\sum_{\horizon=0}^{\horizontotal-1} \sqrt{\sum_{(\state,\action)\in\statespace_\horizon\times\actionspace} \locsupport_{\state,\action}\episodetotal} \\
    %     &+ \horizontotal\statesize^2\actionsize\confcountxa + \horizontotal\conflossf \Bigg),
    % \end{aligned}
    % \end{equation}
    \begin{equation}\notag
    \begin{aligned}
        \expect\sbr{\regret} \leq \widetilde{\cO}\Big(\horizontotal\cumlocsupport \sqrt{\episodetotal} + \horizontotal\statesize^2\actionsize\confcountxa + \horizontotal\conflossf \Big).
    \end{aligned}
    \end{equation}
\end{theorem}
\begin{proof}
    We decompose the regret as the sum of the following two terms, $\textsc{Error} = \sum_{\episode=1}^\episodetotal \inner{\occmeasure^{\transeasy, \policy_\episode} - \occmeasure_\episode}{ \loss_\episode}$, $\textsc{Reg} = \sum_{\episode=1}^\episodetotal \inner{\occmeasure_\episode-\occmeasure^*}{ \loss_\episode}$, and then bound them separately.

    % The term $\text{ERROR}$ results from the lack of knowledge about the environment's dynamics.
    % $\text{ERROR}$ results from that the agent selects occupancy measures within the confidence set, which are not exactly occupancy measures of $\transeasy$.
    % It is the difference between the loss of the agent's chosen policies in $\transeasy$ and that in the "optimistic" MDP induced by $\occmeasure_\episode$.
    $\textsc{Error}$ quantifies the cumulative difference between the loss incurred by the agent's chosen policy in the true transition $\transeasy$ and the ``optimistic'' MDP transition $\transeasy_\episode$ induced by   $\occmeasure_\episode$, 
    where $\transeasy_\episode = \transeasy^{\occmeasure_\episode} \in \transspace_\episode$ ensuring that $\occmeasure_\episode = \occmeasure^{\transeasy_\episode,\policy_\episode}$ by definition of $\policy_\episode$ and Eq. \eqref{eq: occupancy, transition, policy}.
    Specifically, the agent selects occupancy measures within the confidence set, which are not exactly the occupancy measures of $\transeasy$.
    Since all losses are in $[0,1]$, we have $\textsc{Error}\leq\sum_{\episode=1}^\episodetotal\sum_{\state,\action}\abr{\occmeasure^{\transeasy, \policy_\episode}\rbr{\state,\action} - \occmeasure^{\transeasy_\episode, \policy_\episode}\rbr{\state,\action}}$, and with probability at least $1-7\delta$, $\textsc{Error}\leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport\sqrt{\episodetotal} + \statesize^2\actionsize\horizontotal\confcountxax}$.

    The core idea of controlling $\textsc{Reg}$  is introducing a pseudo-private algorithm as an intermediate step.
    Instead of injecting identically distributed noise $\noise$ at each episode in our algorithm, the pseudo-private algorithm uses a one-shot noise injection at the very start of the algorithm, i.e., $\losscumpri - \losscum = \widehat{\noiseeasy}\rbr{\state,\action}$ for all $\rbr{\state,\action}$, and then applies the same FTRL method in Eq.~\eqref{eq: update occupancy measure full info} to obtain pseudo occupancy measure $\occmeasvir_\episode$.
    % \sadegh{Earlier we denoted: $\losscumpri - \losscum = \noiseeasy\rbr{\state,\action}$. Please double check. \sj{here $\widehat{\noiseeasy}\rbr{\state,\action}$ is the noise injected into the pseudo-private algorithm rather than the original algorithm, we bound the \textsc{Reg} term by bounding the regret of the pseudo-private algorithm.}}
    Benefiting from Assumption \ref{assp: Private loss in full-information setting}, the noise injected in both algorithms follow the same distribution, then the distribution of $\occmeasure_\episode$ is identical to that of $\occmeasvir_\episode$.
    Therefore, we can bound $\textsc{Reg}$ by bounding the regret of the pseudo algorithm. 
    Applying a similar FTRL analysis used in private online learning \citep{agarwal2017price}, the regret bound of the pseudo algorithm consists of three key components: a stability term that constrains the change in $\occmeasvir$ per episode and two bias terms arising from regularization and the one-shot noise injection.
    With the help of Assumption \ref{assp: Private loss in full-information setting}, we derive $\expect\sbr{\textsc{Reg}} \leq \cO\rbr{\horizontotal\sqrt{\episodetotal \ln \frac{\statesize\actionsize}{\horizontotal}} + \horizontotal\conflossf}$.
    % which introduces one-shot noise that doesn't perturb the stability term of the FTRL analysis but incurs a cost in the bias term instead.
    % Specifically, the virtual algorithm also applies the FTRL method in Eq.~\eqref{eq: update occupancy measure full info} to update occupancy measures $\occmeasvir_\episode$, but using a one-shot noise injection, at the very start of the algorithm, i.e., $\losscumpri - \losscum = \widehat{\noiseeasy}\rbr{\state,\action}$ for all $\rbr{\state,\action}$, rather than injecting identically distributed noise $\noise$ at each episode.
    % As the noises injected in both algorithms follow the same distribution, we observe that the distribution of $\occmeasure_\episode$ in our algorithm is identical to that of $\occmeasvir_\episode$ in the virtual algorithm.
    % Therefore, we can bound $\textsc{Reg}$ by bounding the regret of the virtual algorithm. 
    % Applying a similar FTRL analysis used in private online learning \citep{agarwal2017price}, the virtual regret bound consists of three key components: a stability term that constrains the change in $\occmeasvir$ per episode, along with two bias terms arising from regularization and the one-shot noise injection.
    % Formally, $\sum_{\episode=1}^\episodetotal \inner{\occmeasvir_\episode-\occmeasure^*}{\loss_\episode} \leq \sum_{\episode=1}^\episodetotal \inner{\occmeasvir_\episode-\occmeasvir_{\episode+1}}{\loss_\episode} + \frac{1}{\eta} \distc_\psi + \distc_{\widehat{\noiseeasy}}$.
    % where $\distc_\psi = \max_{\occmeasure\in\occmeasureset\rbr{\transeasy}} \regularizer\rbr{\occmeasure} - \min_{\occmeasure\in\occmeasureset\rbr{\transeasy}} \regularizer\rbr{\occmeasure}$ and $\distc_{\widehat{\noiseeasy}} = \max_{\occmeasure\in\occmeasureset\rbr{\transeasy}} \inner{\widehat{\noiseeasy}}{\occmeasure} - \min_{\occmeasure\in\occmeasureset\rbr{\transeasy}} \inner{\widehat{\noiseeasy}}{\occmeasure}$.
    % Under Assumption~\ref{assp: private counts}, we derive 
    % $\expect\sbr{\textsc{Reg}} \leq O\rbr{\horizontotal\sqrt{\episodetotal \ln \frac{\statesize\actionsize}{\horizontotal}} + \horizontotal\conflossf}$.
\end{proof}

% \subsubsection{Basic version: $L_1$ norm error bound}
% The following concentration bounds on the private estimates will be the key to our algorithm design.
% The results follow \cite{chowdhury2022differentially}.
% \begin{lemma}[Concentration of private estimates]
% \label{lemma: concentration of transition estimates}
%     Fix any $\pripara>0$ and $\delta\in(0,1]$. Then, under assumption \ref{assp: private counts}, with probability at least $1-2\delta$, uniformly over all $\rbr{\state,\action,\horizon,\episode}$,
%     \begin{equation}
%         \normtranserror_\episode \rbr{\state,\action} \triangleq \nbr{\transeasy_\horizon \rbr{\cdot \vert \state, \action} - \transprieasy_\horizon^\episode \rbr{\cdot \vert \state, \action} }_1 \leq \confnormtrans{\rbr{\state,\action}},
%     \end{equation}
%     where $\confnormtrans{\rbr{\state,\action}} := \frac{\confconstant}{\sqrt{\max\cbr{1,\visitxatotalpri + \confcountxa}}} + \frac{\statesize\confcountxax + 2\confcountxa}{\max\cbr{1,\visitxatotalpri + \confcountxa}}$, and $\confconstant := \sqrt{4\statesize\ln\frac{6\statesize\actionsize\horizontotal\episodetotal}{\delta}}$.
% \end{lemma}

% \subsubsection{Advanced version: point-wise error bound}
% \begin{lemma}[Refined concentration of private estimates]
% \label{lemma: refined concentration of transition estimates}
%     Fix any $\pripara>0$ and $\delta\in(0,1]$. Then, under assumption \ref{assp: private counts}, with probability at least $1-3\delta$, uniformly over all $\rbr{\state,\action,\horizon,\episode}$,
%     \begin{equation}
%         \abr{\transeasy_\horizon \rbr{\state^\prime \vert \state, \action} - \transprieasy_\horizon^\episode \rbr{\state^\prime \vert \state, \action} } 
%         \leq C \sqrt{\frac{\confconstpw \transeasy_\horizon \rbr{\state^\prime \vert \state, \action}}{\max\cbr{1,\visitxatotalpri + \confcountxa}}} + \frac{C\confconstpw + 2\confcountxa + \confcountxax}{\max\cbr{1,\visitxatotalpri + \confcountxa}},
%     \end{equation}
%     where $C>0$ is some constant, and $\confconstpw := \log\frac{6\statesize\actionsize\horizontotal\episodetotal}{\delta}$.
% \end{lemma}





