\section{Bandit Setting} \label{sec: bandit-feedback setting}
In this section, we turn to investigate the private adversarial RL algorithm under the bandit setting.
We propose \textit{Private Upper Occupancy Bound Log-Barrier Policy Search} (Private UOB-LBPS) framework 
based on the non-private version in \cite{jin20c}.
However, there are three main differences in our algorithm.
Firstly, we apply a new confidence set of the transition function defined in Eq.~\eqref{eq: confidence set of transition}, which is strictly tighter and helps achieve problem-dependent regret bound.
Secondly, we introduce a novel private loss estimator that maintains nice properties, i.e., optimistic estimation, bounded perturbation, and non-negativity.
Thirdly, we involve a log-barrier regularizer to update occupancy measures, which helps us attain a tighter stability term.

We provide a brief description of the algorithm, deferring the full pseudo-code to the appendix.
% This novel loss estimator is critical to our regret bound, i.e., a "\textsc{Bias}" term caused by the biased loss estimation, and a "\textsc{Reg}" term with respect to the bounded loss estimation. 
% In each episode $\episode$, we firstly privatize the observed loss with the privacy mechanism, which satisfies Assumption \ref{assp: Private loss in bandit feedback setting}, and obtain the private loss $\losspri$ for all $\rbr{\state,\action}$.
In each episode $\episode$, we obtain the private loss $\losspri$ for all $\rbr{\state,\action}$ by privatizing the observed loss with the privacy mechanism, which may be unbounded and negative.
To make the loss function bounded, we require %assume 
the perturbations on the observed loss %are required 
not to exceed%to be less than 
a specific threshold $\ninterval$ with high probability, as formally specified in Assumption \ref{assp: Private loss in bandit feedback setting}.
Then, we scale the private loss to $[0,1]$ to obtain
\begin{equation}
\label{eq: loss rescale}
    \ddot{\loss}_\episode\rbr{\state,\action} = \frac{\lossprieasy_\episode\rbr{\state,\action} + \ninterval}{2\ninterval + 1},
\end{equation}
and then construct an optimistic loss estimators $\lossest$ using the (efficiently computable) \emph{upper occupancy bound} $\uppocc_\episode$, similar to \cite{jin20c}:
\begin{equation}
\label{eq: loss estimator}
    \lossest = \frac{\ddot{\loss}_\episode\rbr{\state,\action}}{\uppocc_\episode\rbr{\state,\action}},
\end{equation}
where $\uppocc_\episode(\state,\action) = \max_{\transeasy\in\transspace_{\episode}} \occmeasure^{\transeasy,\policy_\episode}(\state,\action)$.

% Clearly, it is a biased estimator for $\lossprieasy_\episode^\prime\rbr{\state,\action}$, since $\transeasy_\episode$ is not exactly the true transition.
Next, we construct confidence set $\transspace_{\episode+1}$ in the same way as in the full-information setting (Eq.~\eqref{eq: confidence set of transition}).
% The algorithm privatizes the visiting counters $\visitxatotalpri,\visitxaxtotalpri$ following the Assumption \ref{assp: private counts} in the full-information setting. 
% Regarding the bandit loss $\losspri$, we straightly add the requisite magnitude of Laplace noise $\lap{\lambda}$ to ensure (local) differential privacy.
% To ensure the noisy losses are bound, we pick a threshold $\ninterval$ such that w.h.p., the noise injected will be inside the interval $\sbr{-\ninterval,\ninterval}$.
% Only in episodes where the added noises fall inside such intervals, we first scale the noisy losses back to $[0,1]$ and then update policies similar to the non-private setting (Bounded Bandit UC-O-REPS).
% Specially, at each episode $\episode$, using the private counts and losses, it first computes private transition estimates, confidence set, and the loss estimates.
% To this end, due to technical reasons, we assume that the optimal occupancy measure satisfies the \emph{$\alpha$-reachability assumption} \citep{neu2010online}, that is, its visitation probability to any state  is larger than some $\alpha>0$.\footnote{This assumption is widely adopted in the literature \citep{neu2010online,rosenberg2019onlinessp}} Letting $\occmeasureset_\alpha\!\subset\!\occmeasureset$ denote the set of all valid occupancy measures satisfying $\alpha$-reachability under any policy, we verify: $\occmeasure^*\!=\! \argmin_{\occmeasure\in\occmeasureset_\alpha\rbr{\transeasy}} \sum_{\episode=1}^\episodetotal\inner{\occmeasure}{\loss_\episode}$. 
Finally, we find $q_{k+1}$ via Online Mirror Descent (OMD):
% restricting it to lie in $\occmeasureset_\alpha$:
\begin{equation}
\label{eq: update occupancy, bandit-info}
    \occmeasure_{\episode+1} = \argmin_{\occmeasure \in \occmeasureset\rbr{\transspace_{\episode+1}}} \inner{\lossesteasy_{\episode}}{\occmeasure} + \frac{1}{\FTRLpara}\divg{\occmeasure}{\occmeasure_{\episode}},
\end{equation}
where $\divgeasy$ is the Bregman divergence of a log-barrier regularizer  $\regularizer$, which leads to a better stability term in the analysis, 
\begin{align}\label{eq: regularizer in bandit setting}
\regularizer(\occmeasure) = \sum_{\horizon=0}^{\horizontotal-1}\sum_{\state\in\statespace_{\horizon}}\sum_{\action\in\actionspace}\sum_{\state'\in\statespace_{\horizon+1}} \log\frac{1}{\occmeasure\rbr{\state,\action,\state'}}.
\end{align}
%We assume that the visit probability to any state under any policy is larger than $\alpha>0$ to ensure exploration, and denote $\occmeasureset_\alpha$ as the set of valid occupancy measures satisfying the assumption
%Use of $\occmeasureset_\alpha$, in lieu of $\occmeasureset$, is to ensure exploration\footnote{ }.
Note that this optimization problem can also be solved efficiently via, e.g., Algorithm 4 in \cite{lee2020bias}.

Formally, the Privatizer for the loss function should satisfy the following assumption.
\begin{assumption}[Private loss in bandit feedback setting] \label{assp: Private loss in bandit feedback setting}
For all $\rbr{\state,\action,\episode}$, $\noise:= \losspri - \loss_\episode\rbr{\state,\action}\II_\episode(\state,\action)$ are i.i.d.~zero-mean random variables; and for some $\ninterval >0$, with probability at least $1-\delta$ uniformly over all $\rbr{\state,\action,\episode}$, $\abr{\noise}$ $\leq$ $\ninterval$.
\end{assumption}

When instantiated with any Privatizer satisfying Assumption \ref{assp: private counts} and Assumption \ref{assp: Private loss in bandit feedback setting}, a general regret bound for Private UOB-LBPS can be obtained as stated  below.
\begin{theorem}
\label{thm: Regret bound of Private UOB-LBPS}
Fix any $\pripara>0$ and set $\FTRLpara = \sqrt{\frac{\statesize}{\episodetotal}},\delta=\frac{\statesize\actionsize}{\episodetotal}$. Then, under Assumption \ref{assp: private counts} and Assumption \ref{assp: Private loss in bandit feedback setting}, the regret of Private UOB-LBPS satisfies
    % \begin{equation}\notag
    % \begin{aligned}
    %     \expect\sbr{\regret} \leq
    %     \cO\Bigg( & \horizontotal \sum_{\horizon=0}^{\horizontotal-1} \sqrt{  \sum_{(\state,\action)\in\statespace_\horizon\times\actionspace} \locsupport_{\state,\action}\episodetotal} \\
    %     & + \horizontotal\statesize^4\actionsize\confcountxa + \horizontotal\ninterval\sqrt{\episodetotal} \Bigg)
    % \end{aligned}
    % \end{equation}
    \begin{equation}\notag
    \begin{aligned}
        \expect\sbr{\regret} \leq
        \widetilde{\cO}\Bigg(\horizontotal\cumlocsupport \sqrt{\episodetotal} + \horizontotal\statesize^4\actionsize\confcountxa + \actionsize\ninterval\sqrt{\statesize^3\episodetotal}\Bigg).
    \end{aligned}
    \end{equation}
    
\end{theorem}
\begin{proof}
     We decompose the regret as the sum of the following three terms: $\textsc{Error} \!= \sum_{\episode=1}^\episodetotal \langle\occmeasure^{\transeasy,\policy_\episode} \!- \occmeasure^{\transeasy_\episode,\policy_\episode},\loss_\episode\rangle$, $\textsc{Bias} = \sum_{\episode=1}^\episodetotal \langle\occmeasure^{\transeasy_\episode,\policy_\episode} - \occmeasure^*, \loss_\episode - \lossesteasy_\episode \rangle$, and $\textsc{Reg} = \sum_{\episode=1}^\episodetotal \langle\occmeasure^{\transeasy_\episode,\policy_\episode} - \occmeasure^*,\widehat{\loss}_\episode\rangle$. 

    To bound $\textsc{Error}$, we directly borrow the analysis in Theorem \ref{thm: Regret bound of Private UC-O-REPS}.
    To deal with bias caused by the scaling step, we define an intermediate variable $g_\episode\rbr{\state,\action}=\frac{\loss_\episode\rbr{\state,\action}}{2\ninterval+1}$, %and help such decomposition, 
    which allows for having the following decomposition:
    \begin{align*}
    \textsc{Bias} &= \sum_{\episode=1}^\episodetotal \langle\occmeasure^{\transeasy_\episode,\policy_\episode} - \occmeasure^*, \loss_\episode - g_\episode\rangle + \sum_{\episode=1}^\episodetotal \langle\occmeasure^{\transeasy_\episode,\policy_\episode},g_\episode - \lossesteasy_\episode\rangle \\
    &+ \sum_{\episode=1}^\episodetotal \langle\occmeasure^*,\lossesteasy_\episode -g_\episode\rangle.
    \end{align*}
    The first term is bounded by $\regret-\textsc{Error}$ by basic decomposition.
    With the help of the upper occupancy measure and the intermediate variable, the second term mainly depends on $\sum_{\episode=1}^\episodetotal \sum_\state \vert \uppocc_\episode(\state) - \occmeasure^{\transeasy,\policy_\episode}(\state)\vert$, which can be nicely controlled by using our confidence set in Eq.~\eqref{eq: confidence set of transition}.
    Besides, the third term is non-positive by the definition of our biased loss and upper occupancy measure.
    
    % Notice that the estimated loss is a biased estimator due to the inaccurate transition estimate and scaling process, i.e., $\expect\sbr{\lossesteasy_{\episode}\rbr{\state,\action} \vert \traj_{1:\episode-1}} 
    % =  \frac{\occmeasure^{\transeasy,\policy_\episode}\rbr{\state}}{\occmeasure^{\transeasy_\episode,\policy_\episode}\rbr{\state}} \cdot \frac{\loss_\episode\rbr{\state,\action} + \ninterval}{2\ninterval+1}$.
    % Based on the $\alpha-$reachability assumption, and a careful analysis of the influence of such biased estimates on $\textsc{Bias}$, we have $\abr{\expect\sbr{\textsc{Bias}}} \leq \frac{2\ninterval}{2\ninterval+1} \cdot \expect\sbr{{\regret}} + \rbr{\frac{\ninterval+1}{\alpha\cdot(2\ninterval+1)} - \frac{2\ninterval}{2\ninterval+1}} \cdot \expect\sbr{\textsc{Error}}.$
    Regarding $\textsc{Reg}$ term, benefiting from our non-negative loss estimator and the log-barrier regularizer, we have a smaller ``stability'' term compared with negative entropy regularizer, in the form of $\EE[\sum_{\episode}\sum_{\state,\action}\occmeasure_\episode(\state,\action)^2\lossesteasy_\episode^2(\state,\action)]$.
    The result comes from a standard analysis in \cite{agarwal2017corralling}.
\end{proof}