\section{Privacy and Regret Guarantees}
\label{sec: privacy and regret guarantees}
In this section, we design the Privatizers that satisfy the required assumptions for the considered %different 
feedback settings (full-information and bandit) 
and privacy constraints (JDP or LDP). 
% All proofs in this section are deferred to Appendix \ref{appen: privacy guarantee proof}.

\subsection{Achieving JDP using Central Privatizer}
\label{ssec: Achieving JDP using Central Privatizer}
% The Central Privatizer protects the information of all single users by privatizing all the counters streams $\visitxatotal,\visitxaxtotal$ and $\losscum$ using the Binary Mechanism \cite{chan2011private} under full-information setting, and privatizing streams $\loss_\episode\rbr{\state_\horizon^\episode,\action_\horizon^\episode}$ using the Laplace Mechanism \cite{dwork2014algorithmic} under bandit setting.
The Central Privatizer protects the information of all individual users by privatizing all the visitation counters and losses.
Specifically, given privacy budget $\pripara>0$, we construct the Central Privatizer as follows:

$(1)$ For all $\rbr{\state,\action,\state^\prime}$, we privatize $\cbr{\visitxatotal}_{\episode\in\sbr{\episodetotal}}$ and $\cbr{\visitxaxtotal}_{\episode\in\sbr{\episodetotal}}$ by the  Binary Mechanism \citep{chan2011private} with $\pripara^\prime = \frac{\pripara}{3\horizontotal\log\episodetotal}$. 
Denoting the output of the Binary Mechanism by $\visitxatotalbineasy_\episode$, the private counts $\visitxatotalprieasy_\episode$ are obtained by the procedure in Section \ref{ssub-sec: Post-processing steps}. 

$(2)$ Under the full-information setting, for all $\rbr{\state,\action}$, we privatize $\cbr{\losscum}_{\episode\in[\episodetotal]}$  by a variant of the Binary Mechanism 
% (see Algorithm \ref{algo: Private Counter (PC) for losscum}) 
with $\pripara^\prime = \frac{\pripara}{3\horizontotal\log\episodetotal}$ (see Section \ref{ssub-sec: Post-processing steps}).

$(3)$ Under the bandit setting, for all $\rbr{\episode,\state,\action}$, we directly use the Laplace Mechanism \citep{dwork2014algorithmic} with $\pripara^\prime = \frac{\pripara}{3\horizontotal}$, i.e., $\losspri = \loss_\episode(\state,\action)\II_\episode\rbr{\state,\action} + \lap{\frac{3\horizontotal}{\pripara}}$\footnote{Here, we slightly overloaded notation, and used $\lap{\cdot}$ to represent a zero-mean Laplace variable with parameter $\cdot$.}.

We summarize the properties of Central Privatizer in the following lemma.
\begin{lemma}
\label{lemma: Properties of Central-PRIVATIZER}
For any $\pripara\!>\!0$, the Central Privatizer under both full-information and bandit settings is $\pripara$-DP. 
For any $\delta\!\in\!(0,1]$, and $\episodetotal>\sqrt{\statesize\actionsize}$, it satisfies privacy assumptions with $\confcountxa = \cO\rbr{\frac{3\horizontotal}{\pripara}\log^{1.5}\episodetotal \log\iota}$, $\conflossf\!=\!\cO(\frac{3\horizontotal}{\pripara}\sqrt{\log^3\episodetotal \ln\rbr{\statesize\actionsize}})$, and $\ninterval = \frac{3\horizontotal}{\pripara}\log\iota$.
% $(1)$ For visiting counters under both full-information and bandit information setting, Privatizer is with parameter $1/\pripara^\prime = \frac{3\horizontotal\log\episodetotal}{\pripara}$.
% Furthermore, for any $\delta\in(0,1]$, it satisfies assumption \ref{assp: private counts} with $\confcountxa = O\rbr{\frac{3\horizontotal}{\pripara}\log^{1.5}\episodetotal \log\rbr{ \frac{3\statesize^2\actionsize\episodetotal}{\delta}}}$
% $(2)$ For loss counters, under full-information setting, the number of Laplace noise with parameter $\lambda = 1/\pripara^\prime = \frac{3\horizontotal\log\episodetotal}{\pripara}$ injected is $\noisecount = \log \episodetotal$. 
%     Under bandit-information setting, the injected Laplace noise to each suffered loss is $\lap{\frac{3\horizontotal}{\pripara}}$.
\end{lemma}
% Lemma \ref{lemma: Properties of Central-PRIVATIZER} follows the privacy properties of the Binary Mechanism \citep{chan2011private}, and maxima bound of the sum of i.i.d. Laplace random variables (Refer to Lemma \ref{lemma: Maxima of Laplace Variables}).
Using Lemma \ref{lemma: Properties of Central-PRIVATIZER}, as corollaries of Theorem \ref{thm: Regret bound of Private UC-O-REPS} and Theorem \ref{thm: Regret bound of Private UOB-LBPS}, we obtain the regret and privacy guarantees for Private-UC-O-REPS and Private-UOB-LBPS instantiated using the Central Privatizer.
\begin{theorem}[Problem-dependent Regret under JDP]
\label{crl: Regret under JDP}
For any $\pripara>0$, 
if instantiated using the Central Privatizer, Private-UC-O-REPS and Private-UOB-LBPS both satisfy $\pripara$-JDP.
Furthermore, we obtain \\
% $(1)$ $\expect\sbr{\regret^{\text{Full}}} \leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2}{\pripara}}$; \\
% $(2)$ $\expect\sbr{\regret^{\text{Bandit}}} \leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\actionsize\horizontotal\sqrt{\statesize^3\episodetotal}}{\pripara}}$.
\begin{equation}
\begin{aligned}\notag
    \expect\sbr{\regret^{\text{Full}}} &\leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2}{\pripara}}, \\
    \expect\sbr{\regret^{\text{Bandit}}} &\leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\actionsize\horizontotal\sqrt{\statesize^3\episodetotal}}{\pripara}}.
\end{aligned}
\end{equation}
\end{theorem}

\begin{remark}
The Private-UC-O-REPS and Private-UOB-LBPS with JDP guarantee improve over the best existing results in non-private settings for both full-information and bandit settings \citep{rosenberg2019onlineamdp,jin20c} by making appear a problem-dependent term and also match them in the worst case, i.e., $\widetilde{\cO}\rbr{\horizontotal\statesize\sqrt{\actionsize\episodetotal}}$. 
% $(1)$ $\expect\sbr{\regret^{\text{Full}}} \leq \cO\rbr{\horizontotal\statesize\sqrt{\actionsize\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2}{\pripara}}$; \\
% $(2)$ $\expect\sbr{\regret^{\text{Bandit}}} \leq \cO\rbr{\horizontotal\statesize\sqrt{\actionsize\episodetotal} + \frac{\actionsize\horizontotal\sqrt{\statesize^3\episodetotal}}{\pripara}}$.
\end{remark}
\begin{remark}
Compared to the lower bound for the stochastic RL with the JDP guarantee in ~\cite{vietri2020private}, $\Omega\rbr{H\sqrt{XAK} + \frac{XAH\log K}{\epsilon}}$, our bounds have an optimal dependency on the privacy budget $\pripara$.
In terms of $\episodetotal$, the privacy cost in the full-information setting is a lower order term compared with the non-private term, which is dominated by the %estimated 
estimation error on the transition function due to private visitation counters. 
However, the privacy cost is sub-optimal in the bandit setting, and the dominant factor is regret associated with the private loss estimator, given the stronger privacy guarantee for loss in Lemma~\ref{lemma: Privacy Guarantees of Bandit Losses}.
This gap may be attributed to the inefficiency of our privacy mechanism but might also arise due to a loose lower bound.
\end{remark}

\subsubsection{Post-processing steps}
\label{ssub-sec: Post-processing steps}
During the $\episode$-th episode, given the noisy counts $\visitxatotalbin$, $\visitxaxtotalbin$, $\losscumbin$ for all $\rbr{\state,\action,\state^\prime}$ from the classical Binary Mechanism \citep{chan2011private}, we construct the following private counters as follows.

\textbf{Private visitation counters.} To satisfy Assumption \ref{assp: private counts}, we use the techniques from \cite{qiao2023near}.
Firstly, we solve the optimization problem\footnote{Note that Problem \eqref{opt: counter post-processing} is a linear program with $\cO(\statesize_{\horizon(\state)+1})$ variables and $\cO(\statesize_{\horizon(\state)+1})$ linear constraints, which can be solved efficiently using existing algorithms for linear programming. A fast implementation could be via the simplex method \citep{ficken2015simplex}.} 
for all $\rbr{\state,\action}$ below.
% \begin{equation}
% \small
% \begin{aligned} 
% \label{opt: counter post-processing}
%     \min t \,\,\,\,
%     \text{s.t.} \,\,\, & n \!\sas \! \! \geq 0,  \,\,\,\,
%     \abr{n \!\sas  \!- \! \visitxaxtotalbin}  \!\leq t, \\
%     & \bigg|\sum\nolimits_{\state^\prime\in\statespace_{\horizon(\state)+1}} n\sas - \visitxatotalbin\bigg| \leq \frac{\confcountxa}{4}. 
% \end{aligned}
% \end{equation}
\begin{equation}
\begin{aligned} 
\label{opt: counter post-processing}
    \min t \,\,\,\,
    \text{s.t.}& \,\, n(\state') \geq 0, \quad \forall x',\\
    & \abr{n(\state')  -  \visitxaxtotalbin}  \leq t, \quad \forall x',\\
    & \abr{\sum\nolimits_{\state^\prime\in\statespace_{\horizon(\state)+1}} n(\state') - \visitxatotalbin} \leq \frac{\confcountxa}{4}. 
\end{aligned}
\end{equation}
Letting $\visitxaxtotalopt$ denote a minimizer of this problem, %\ST{and} $n^*\!\sas$ \ST{its solution}. 
we define $\visitxatotalopt = \sum_{\state^\prime\in\statespace_{\horizon(\state)+1}} \visitxaxtotalopt$.
By adding an additional term, as done below, we make sure that the private counts $\visitxatotalpri$ never underestimate the respective true counts:
%\sadegh{Will you use $n^*\!\sas$ at all?}
\begin{equation}
\label{eq: counters - adding term to optimization solution}
% \small
\begin{aligned}
    \visitxatotalpri &= \visitxatotalopt + \frac{\confcountxa}{2}, \\
    \visitxaxtotalpri &= \visitxaxtotalopt + \frac{\confcountxa}{2\statesize_{\horizon+1}}.
\end{aligned}
\end{equation}
The private counts $\visitxatotalprieasy_\episode$ satisfy the following property.
\begin{lemma}
% \small
\label{lemma: counter property of binary mechanism}
Suppose $\visitxatotalbineasy_\horizon^\episode$ satisfy  % it holds that
\begin{equation}
\begin{aligned}
    \abr{\visitxaxtotalbin - \visitxaxtotal} &\leq \frac{\confcountxax}{4}, \\
    \abr{\visitxatotalbin - \visitxatotal } &\leq \frac{\confcountxax}{4},  
\end{aligned}
\end{equation}
for all $\rbr{\horizon,\episode,\state,\action,\state^\prime}$, with probability $1-2\delta$. Then, 
$\visitxatotalprieasy_\episode$ derived from Eq.~\eqref{opt: counter post-processing} and Eq.~\eqref{eq: counters - adding term to optimization solution} satisfy Assumption~\ref{assp: private counts}.
\end{lemma}

\textbf{Private loss in full-information setting.} We use a variant of the Binary Mechanism which maintains the same privacy guarantee as the standard Binary Mechanism but has better distributional properties for our problem (see Lemma \ref{lemma: Guarantees of the Variant of Binary Mechanism}). 
That is, for all $\rbr{\episode,\state,\action}$, we post-process $\losscumbin$ by injecting more noise such that the perturbation on $\losscum$ is a summation of $\lceil\log \episodetotal\rceil$ i.i.d.~Laplace variables.
Thus, Assumption \ref{assp: Private loss in full-information setting} is satisfied by the maxima of the sum of i.i.d.~Laplace variables (see Lemma \ref{lemma: Maxima of Laplace Variables}).

% During any $\episode$-th episode, the classical binary mechanism adds a minimum number $\rbr{n_{min} \leq \log\episodetotal}$ i.i.d. sampled Laplace noises $\lap{\frac{1}{\pripara^\prime}}$ to the partial sum $\losscum$, where $\pripara^\prime>0$ is a given privacy parameter.
% To make sure all the accumulative losses are added the same amount of noise, we post-process the summation by injecting $\log \episodetotal - n_{min}$ more noise, and output as the private count $\losscumpri$ with $\log \episodetotal$ noises in total injected.

\textbf{Private loss in bandit setting.} Assumption \ref{assp: Private loss in bandit feedback setting} is satisfied by the concentration of Laplace variables \citep{boucheron2003concentration}. 
Moreover, the following lemma for private loss is also held by using the property of the Laplace Mechanism.
% and Lemma 34 of \cite{hsu2014private}, 
\begin{lemma}
\label{lemma: Privacy Guarantees of Bandit Losses}
     As defined in Section \ref{ssec: Achieving JDP using Central Privatizer}, the sequence $\cbr{\lossprieasy_\episode\rbr{\state,\action}}_{\rbr{\state,\action,\episode}}$ satisfies both $\pripara/3$-DP and $\pripara/3$-LDP.
\end{lemma}

\subsection{Achieving LDP using Local Privatizer}
The Local-Privatizer, at each episode $\episode$, releases the private counts by perturbing the statistics computed from the trajectory generated in that episode.
Given the privacy budget $\pripara>0$, we construct Local Privatizer as follows:

$(1)$ For all $(\episode,\state,\action,\state^\prime)$, we perturb the true count $\datas_\episode\!\rbr{\state,\action}\!:=\!\II_\episode\!\rbr{\state,\action}$ by injecting independent Laplace noises: $\dataspri_\episode(\state,\action) = \datas_\episode(\state,\action) + \lap{3\horizontotal/\pripara}$.
Then, the noisy counts are calculated by $\visitxatotalbin = \sum_{i=1}^{\episode-1} \dataspri_i(\state,\action)$. 
The counter $\visitxaxtotalbin$ is obtained in a similar way. 
To this end, through the post-processing in Section \ref{ssub-sec: Post-processing steps}, we get the private counts $\visitxatotalprieasy_\episode$.
% $(1)$ For all $\rbr{\episode,\state,\action,\state^\prime}$, we perturb the true count $\datas_\episode\rbr{\state,\action}=\II\rbr{\state_{\horizon}^\episode,\action_{\horizon}^\episode = \state,\action}$ by injecting independent Laplace noises: $\dataspri_\episode\rbr{\state,\action} = \datas_\episode\rbr{\state,\action} + \xi_\episode\rbr{\state,\action}$, where $\horizon=\horizon\rbr{\state}$ and where $\xi_k\sim\lap{3\horizontotal/\pripara}$ with $\lap{}$ denoting the Laplace distribution. 
% Then, the noisy counts are calculated by $\visitxatotalbin = \sum_{i=1}^{\episode-1} \dataspri_i\rbr{\state,\action}$. 
% The counter $\visitxaxtotalbin$ can be obtained in a similar way. 
% To this end, through the post-processing in Section \ref{ssub-sec: Post-processing steps}, we get the private counts $\visitxatotalprieasy_\episode$.

$(2)$ Under the full-information setting, for all $\rbr{\episode,\state,\action,\state^\prime}$, we perturb the observed loss by adding independent Laplace noise: $\lossprieasy_\episode\rbr{\state,\action} =  \loss_\episode\rbr{\state,\action} + \lap{3\horizontotal/\pripara}$.
The accumulative statistic is calculated by $\losscumpri = \sum_{i=1}^{\episode-1} \lossprieasy_\episode\rbr{\state,\action}$.

$(3)$ Under the bandit setting, we apply the same mechanism as in Section \ref{ssec: Achieving JDP using Central Privatizer}, with the help of Lemma \ref{lemma: Privacy Guarantees of Bandit Losses}.

% Let us discuss how private counts for the number of visited states are computed.
% At each episode $j$, give privacy parameter $\pripara^\prime \geq 0$, Local-PRIVATIZER perturbs $\datas_\horizon^j\rbr{\state,\action}$ with an independent Laplace noise $Lap\rbr{\frac{1}{\pripara^\prime}}$, i.e., it makes $\statesize\actionsize\episodetotal$ noisy perturbations in total.
% The private counts for the $\episode$-th episode are computed as $\visitxatotalpri=\sum_{j=1}^{\episode-1} \dataspri_\horizon^j\rbr{\state,\action}$, where $\dataspri_\horizon^j\rbr{\state,\action}$ denotes the noisy perturbations.
% And the $\visitxaxtotal$ is also computed similarly.
% Then the private counts $\widetilde{N}_\horizon^\episode$ are solved by the same procedure in Section \ref{ssub-sec: Post-processing steps}.

% For the loss privatization process, under full-information setting, the private counts corresponding to empirical cumulative losses $\losscum$ are computed with visiting counters similarly.
% Under bandit-information setting, the private loss $\losspri_\episode\rbr{\state, \action}$ corresponding to losses of each step is generated as same as the JDP section, i.e., adding Laplace noise straightly.
The properties of the Local-Privatizer are as follows.
\begin{lemma}
\label{lemma: Properties of Local-PRIVATIZER}
For any $\pripara\!>\!0$, the Local Privatizer under both full-information and bandit settings is $\pripara$-LDP. 
For any $\delta\!\in\!(0,1]$, and $\episodetotal>\ln\rbr{\statesize\actionsize}/2$, it satisfies privacy assumptions with $\confcountxa = \cO\rbr{\frac{3\horizontotal}{\pripara}\sqrt{\episodetotal\log\iota}}$, $\conflossf = \cO\rbr{\frac{3\horizontotal}{\pripara}\sqrt{\episodetotal\ln\rbr{\statesize\actionsize}}}$, and $\ninterval = \frac{3\horizontotal}{\pripara}\log\iota$.
\end{lemma}


% \begin{remark}
%     The noise level in the private visitation counts is $O(\log\episodetotal)$ under JDP and $O(\sqrt{\episodetotal})$ under LDP, which incurs the differences on the second term in the bound for JDP and LDP.
% \end{remark}
Combining Lemma \ref{lemma: Properties of Local-PRIVATIZER}, Theorem \ref{thm: Regret bound of Private UC-O-REPS}, and Theorem \ref{thm: Regret bound of Private UOB-LBPS}, we obtain the following regret bound:
\begin{theorem}[Problem-dependent Regret under LDP]
\label{crl: Regret under LDP}
For any $\pripara>0$, 
if instantiated using the Local Privatizer, Private-UC-O-REPS and Private-UOB-LBPS both satisfy $\pripara$-LDP.
Furthermore, we obtain the expected regret,\\
% $(1)$ $\expect\sbr{\regret^{\text{Full}}} \leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}$; \\
% $(2)$ $\expect\sbr{\regret^{\text{Bandit}}} \leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^4\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}$.
\begin{equation}
\begin{aligned}\notag
    \expect\sbr{\regret^{\text{Full}}} &\leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^2\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}, \\
    \expect\sbr{\regret^{\text{Bandit}}} &\leq \widetilde{\cO}\rbr{\horizontotal\cumlocsupport \sqrt{\episodetotal} + \frac{\statesize^4\actionsize\horizontotal^2\sqrt{\episodetotal}}{\pripara}}.
\end{aligned}
\end{equation}

\end{theorem}

\begin{remark}
Similar to the JDP case, the Private-UC-O-REPS and Private-UOB-LBPS with LDP guarantee also enjoy the problem-dependence efficiency, and match the best regret bounds in non-private settings for both full-information and bandit settings \citep{rosenberg2019onlineamdp,jin20c} in the worst case.
\end{remark}

\begin{remark}
In the case of LDP, \cite{garcelon2021local} implies a lower bound of $\Omega\rbr{\frac{H\sqrt{XAK}}{\pripara}}$ for the stochastic episodic RL, for the privacy-related term assuming small enough $\epsilon$ (corresponding to high privacy regime). 
Our bounds also have an optimal dependency on privacy budget $\pripara$ and episode number $\episodetotal$, but a worse dependency on the size of the size of the state-space. 
In the full-information setting, this gap is mainly due to the $L_1$-norm estimated error on the transition function. 
In comparison, in the bandit setting, the main factor is the bias between the upper occupancy measure and the true occupancy measure, influenced by our component-wise confidence set.
\end{remark}

\subsection{Further discussions}
Our Privatizer for visitation counters in Assumption \ref{assp: private counts} is the same as the previous work~\citep{qiao2023near}, but the motivation is different.
In our setting, we apply the post-processing step for $\visitxatotalbineasy_\episode$ to ensure that $\transprieasy_\episode$ is a valid probability distribution so that we can construct a valid occupancy measure space for online optimization (Eq. \eqref{eq: update occupancy measure full info} and Eq. \eqref{eq: update occupancy, bandit-info}).
Meanwhile, the novel Assumption \ref{assp: Private loss in full-information setting} for private cumulative loss $\losscumprieasy_\episode$ helps separate the impact of noise on regret for online optimization. 
The Assumption \ref{assp: Private loss in bandit feedback setting} for private loss estimators also bridges the privacy protection between DP and LDP and plays a vital role in the regret minimization procedure.

The Laplace noise involved in our Privatizer can also be replaced with other noises like Gaussian noise \citep{dwork2014algorithmic}.
According to Theorem \ref{thm: Regret bound of Private UC-O-REPS} and Theorem \ref{thm: Regret bound of Private UOB-LBPS}, the regret bounds can be easily derived by plugging in the corresponding precision level $\confcountxa$, $\conflossf$, and $\ninterval$.
