\documentclass[accepted]{uai2022}
% \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage[normalem]{ulem}
\usepackage{amsfonts}
\usepackage{bm}
\usepackage{xcolor}



\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{caption}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}

\newtheorem{definition}{Definition}

%\newtheorem{lemma}[lemma]{Lemma}
%\newtheorem{proposition}[theorem]{Proposition}
%\newtheorem{remark}[theorem]{Remark}
 \newtheorem{fact}[lemma]{Fact}
% \newtheorem{definition}[definition]{Definition}
% \newtheorem{corollary}[lemma]{Corollary}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Near-Optimal Thompson Sampling-based Algorithms for Differentially Private Stochastic Bandits}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% Add authors
\author[1,2]{\href{mailto:<bingsha1@ualberta.ca>?Subject=Your UAI 2022 paper}{Bingshan Hu}{}}
\author[1,2]{Nidhi Hegde}
%\author[1,2]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[1]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Department of Computing Science\\
    University of Alberta\\
    Edmonton, Alberta, Canada
}
\affil[2]{%
    Amii (Alberta Machine Intelligence Institute)
    % Address\\
    %…
}
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }
  \begin{document}
\maketitle

\begin{abstract}



  We address differentially private stochastic bandits. We
  present two (near)-optimal  Thompson Sampling-based learning algorithms: DP-TS and Lazy-DP-TS. The core idea in achieving optimality  is  the principle of optimism in the face of uncertainty. We reshape the posterior distribution in an optimistic way as compared to the  non-private Thompson Sampling. Our DP-TS achieves a $\sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\} )} \log \left(\frac{\log(T)}{\epsilon \cdot \Delta_j} \right) \right)$ regret bound, where $\mathcal{A}$ is the arm set, $\Delta_j$ is the sub-optimality gap of a sub-optimal arm $j$, and $\epsilon$ is the  privacy parameter. 
  Our Lazy-DP-TS gets rid of the extra $\log$ factor by using the idea of dropping observations. The regret of Lazy-DP-TS  is  $ \sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\}} \right)$, which matches the  regret lower bound. Additionally, we conduct experiments to compare the empirical performance of our proposed  algorithms with the existing optimal  algorithms for differentially private stochastic bandits.

\end{abstract}

% \vspace{-1\baselineskip}
 
\section{Introduction}\label{sec: intro}

We consider the setting of differentially private stochastic multi-armed bandits%, which has previously been investigated
\citep{mishra2015nearly,tossou2016algorithms,shariff2018differentially,sajed2019optimal,hu2021optimal}. In the classical stochastic multi-armed bandit problem, we have a fixed and finite set of $K$ arms %arm set $[K]$ of size $K$ 
and a stochastic environment. In each round $t = 1,2, \dotsc, T$, the environment generates a random reward $X_j(t)$ for arm $j$ which is revealed and collected if arm $j$ is pulled in that round.  %vector $\left(X_1(t), X_2(t), \dotsc, X_K(t) \right) $ which is hidden to the learning algorithm. Simultaneously, the learning algorithm pulls an arm $J_t \in [K]$. 
For each arm $j$, the rewards $X_j(t) \in [0,1]$  are i.i.d. over time according to a fixed but unknown probability distribution with mean $\mu_j$.  %At the end of  round $t$, the environment reveals the random reward of the pulled arm, i.e., the learning algorithm observes and obtains a reward $X_{J_t}(t)$. 
The goal of the learning algorithm  is to pull arms sequentially to maximize the accumulated reward.  The performance metric has traditionally been pseudo-\emph{regret} \citep{bubeck2012regret} which is a measure of the difference of the expected accumulated rewards compared to a given benchmark.

In the classical setting, the learning algorithm uses the true revealed rewards from previous rounds to make decisions on arms in future rounds.  However, in many settings rewards may be private information that should be protected.  For instance, consider an online search advertisement system where the objective is to display relevant ads for web queries. In such a setting the system would display a few advertisements to the user.  When the user clicks on an ad, a reward is collected by the system, which in accumulation would allow the system and any external observers to learn user preferences.  Rewards that represent user preferences are private information and may further allow inference on the user's other private characteristics.  
 


Motivated by such applications, previous work \citep{mishra2015nearly,tossou2016algorithms,sajed2019optimal,hu2021optimal} have studied the %setting of \emph{differentially private} stochastic bandits, that is, 
 design of bandit learning algorithms with differential privacy\citep{dwork2014algorithmic} for keeping reward information private.  Differential privacy has been used as a framework because it provides robust privacy guarantees and a controlled tradeoff with regret guarantees in the case of bandit learning.  


Two of the most common algorithms in the stochastic bandit setting are Upper Confidence Bound (UCB) sampling~\citep{auer2002finite} and Thompson Sampling~\citep{agrawal2017near}.   \citet{mishra2015nearly} present the first differentially private versions of these two algorithms and provide regret bounds for their  algorithms. %However, neither of the derived regret bounds match the regret lower bound that later appears in \citep{shariff2018differentially}.  %% The non-matching of regret bounds is explained in related work, so no need to include it here in the intro.
This was followed by differentially private algorithms %with improved regret bounds 
in the contextual linear bandit setting~\citep{shariff2018differentially} and  optimal differentially private algorithms based on Successive Elimination (DP-SE)~\citep{sajed2019optimal} and UCB (Anytime-Lazy-UCB)~\citep{hu2021optimal} in the stochastic bandit setting.

 
However,  to the best of our knowledge, there is still no optimal Thompson Sampling-based algorithm for differentially private stochastic bandits.  Thompson Sampling-based algorithms exhibit better performance than UCB or SE-based methods, are applicable to a wider range of information models, and are more widely implemented in practical scenarios~\citep{chapelle2011empirical,gopalan14TS}.   Given their widespread implementation in practice, we  provide two (near)-optimal Thompson Sampling-based algorithms for  private stochastic bandits.  The core concept in our  algorithms still relies on the principle of optimism in the face of uncertainty. More specifically, we shift the posterior distribution of an arm to the right  as compared to the posterior distribution in the non-private Thompson Sampling.  %% may need re-wording.




Our first algorithm, Differentially Private Thompson Sampling (DP-TS), can be viewed as a differentially private version of the standard Thompson Sampling for Bernoulli bandits% that has been presented in
~\citep{agrawal2017near}: the learning algorithm makes a decision based on all observations obtained from the beginning and updates the statistics of the pulled arm at the end of each round.  The regret bound of DP-TS is  $\widetilde{O} \left(\frac{K\log(T)}{\min \left\{\epsilon, \Delta \right\}} \right)$, where $\widetilde{O}(\cdot)$ hides an extra $\log (\log(T)/(\epsilon\Delta))$ factor. 
Our second algorithm, Lazy Differentially Private Thompson Sampling (Lazy-DP-TS), drops observations during learning and updates the statistics of the pulled arm in a delayed manner.  With these modifications we
achieve the optimal $O \left(\frac{K\log(T)}{\min \left\{\Delta, \epsilon \right\}}  \right)$ regret bound.  Interestingly, as discussed in Section~\ref{sec: dp-ts and lazy-dp-ts} and confirmed in Section~\ref{sec: experiments} with numerical experiments, DP-TS may perform better than Lazy-DP-TS under some circumstances.


 



{\bf{Contribution}.} We make the following key contributions. (1) We present (near)-optimal Thompson Sampling-based learning algorithms for differentially private stochastic bandits: DP-TS and Lazy-DP-TS; (2) The regret bound for DP-TS is $  \sum\limits_{j \in \mathcal{A}: \Delta_j > 0}
    O \left( \max \left\{ \frac{\log(T)}{\Delta_j},  \frac{\log(T)}{\epsilon} \log \left(\frac{\log(T)}{\epsilon \cdot \Delta_j} \right)\right\} \right) $, which is optimal up to a $\log \log (T)$ factor (Theorem~\ref{DP-TS regret theorem}); (3) The regret bound for Lazy-DP-TS is $\sum\limits_{j \in \mathcal{A}: \Delta_j >0}O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\}} \right)$, which is optimal (Theorem~\ref{Lazy-DP-TS regret theorem}); (4) We show through numerical experiments performance improvement of our proposed learning algorithms as compared to the existing two optimal algorithms, DP-SE and Anytime-Lazy-UCB. 
 


%\section{Learning Problems} \label{sec: learning problem}
\section{Problem definition and background} \label{sec: learning problem}

\subsection{Stochastic Multi-armed Bandits}
%In a stochastic multi-armed bandit problem, we have 
We consider the stochastic multi-armed bandit setting, with a fixed set $\mathcal{A}$ of $K$ arms and a stochastic environment.  At each round %the beginning of round 
$t = 1,2, \dotsc, T$, the environment generates a reward vector $X_t := \left(X_1(t), X_2(t), \dotsc, X_K(t) \right)$ with each $X_j(t) \in \{0,1\}$ independently drawn from a Bernoulli distribution\footnote{As  shown in~\citep{agrawal2017near}, the Bernoulli reward setting can be generalized to any bounded reward setting.} with parameter $\mu_j \in (0,1)$. %Simultaneously, Learner 
The learning algorithm pulls an arm $J_t \in \mathcal{A}$ and at the end of round $t$, observes and obtains the reward of the pulled arm, $X_{J_t}(t)$.  %(Under these conditions, will use the terms observation and reward interchangeably throughout the paper.) 
The goal of the algorithm to select an arm in each round such that the accumulated reward over $T$ rounds is maximized.  % is to pull arms sequentially to accumulate as much reward as possible over $T$ rounds.

Without loss of generality, we assume that the optimal arm is unique and let arm $1$ be the unique optimal arm, i.e., $\mu_1 > \mu_j$ for all $j \in \mathcal{A} \backslash \{1\}$. Let $\Delta_j := \mu_1 - \mu_j$ be the mean reward gap, which indicates the performance loss in a single round when a sub-optimal arm $j$ is pulled instead of the best arm $1$.
We use (pseudo)-regret $\mathcal{R}(T)$ to measure the performance, %of our developed learning algorithms, which can be 
expressed as
\begin{equation*}
\begin{array}{lll}
    \mathcal{R}(T) & = &T \cdot \mu_1 - \sum\limits_{j \in \mathcal{A}} \mathbb{E} \left[\sum\limits_{t=1}^{T} \bm{1} \left\{J_t = j \right\}  \right] \cdot \mu_j  \\& =  & \sum\limits_{j \in \mathcal{A}: \Delta_j >0} \mathbb{E} \left[\sum\limits_{t=1}^{T} \bm{1} \left\{J_t = j \right\}  \right] \cdot \Delta_j  
    \quad.
    \end{array}
\end{equation*}
\subsection{Differential Privacy}\label{dp def}
Differential privacy is a widely accepted framework of privacy and is based on the notion of plausible deniability:  an adversary should learn nearly the same thing if one element in the dataset is changed or missing.  In the context of bandits,  a dataset is the stream of reward vectors drawn throughout the algorithm, and a change would refer to one reward vector in the stream.   More formally, let $\bm{X}_{1:t}$ be the sequence of reward vectors up to time $t$ and let $\bm{X}'_{1:t}$ be a neighbouring sequence which differs in at most one reward vector, say, at any round $s, s \le t$.  The output of a bandit learning algorithm is the sequence of arm selections at each round.  In this context, differential privacy is defined as follows, omitting the subscript $t$ for clarity. % for each time step $t$.


\begin{definition}\label{dp def new}
An online learning algorithm $\mathcal{M}$ is $\epsilon$-differentially private if, at every round $t=1, \dotsc T$, for any two neighbouring reward sequences $\bm{X}$ and  $\bm{X}'$, and for any set $\mathcal{D}$ of decisions made, it holds that
$\mathbb{P} \left\{\mathcal{M}(\bm{X}) \in \mathcal{D} \right\} \le e^{\epsilon} \cdot \mathbb{P} \left\{\mathcal{M}(\bm{X}') \in \mathcal{D} \right\}$.
\end{definition}


\paragraph{Remark.} Our definition of differential privacy follows the standard notion that was introduced in \citep{dwork2014algorithmic}. It
 can also be interpreted as and is very related to the Max  Divergence $D_{\infty}(Q, Q') := \mathop{\max}\limits_{y \in \text{Support}(Q')} \ln \frac{\mathbb{P} (Q=y)}{\mathbb{P}(Q'=y)} $ between two probability distributions $Q$ and $Q'$.   If we view $Q$ as the output distribution (the distribution of the sequentially pulled arms) when working over the true reward sequences $\bm{X}$ and view $Q'$  as the output distribution when working over  $\bm{X}'$, an $\epsilon$-differentially private algorithm guarantees that the maximum divergence between $Q$ and $Q'$ is at most $\epsilon$, i.e., $\ln \left( \frac{\mathbb{P} \left\{Q=y\right\}}{\mathbb{P} \left\{Q'=y \right\}} \right) \le \epsilon$ for all possible output $y$. %Since we have a $\log$ factor in the definition of Max Divergence, that is why we have $e^{\epsilon}$ instead of $\epsilon$ in the definition of differential privacy.
 Actually, the quantity $\ln \left( \frac{\mathbb{P} \left\{Q=y\right\}}{\mathbb{P} \left\{Q'=y \right\}} \right)$ is the privacy loss  that is occurred when an adversary   witnesses the outcome $y$.
    






%\section{Related Work} \label{sec: literature}
\subsection{Related Work} \label{sec: literature}

%Regarding the learning algorithms for  
In the classical stochastic bandit setting, the UCB-based, Thompson Sampling-based, and elimination-based algorithms all achieve good theoretical guarantees. 
Essentially, these algorithms rely on  the empirical means to make decisions and the regret bounds  take the $O(K\log(T)/\Delta)$ form. %\citep{auer2002finite,auer2010ucb} or the $O \left(\frac{K\log(T) \Delta}{d_{\text{KL}}(\mu^*-\Delta, \mu^*)} \right)$ \citep{garivier2011kl,agrawal2017near} form, where $d_{\text{KL}}(a,b)$ is the KL-divergence between two Bernoulli distributions with parameters $a,b$ and $\mu^*$ is the mean reward of the best arm.

As shown in Proposition~2.1 of~\citep{dwork2014algorithmic},  differential privacy is invariant to post-processing, %has a nice property called \emph{immunity to post-processing}, 
i.e., if a learning algorithm takes the output of an $\epsilon$-differentially private  algorithm as input, then the output of this learning algorithm itself is also $\epsilon$-differentially private. In designing stochastic bandit algorithms with differential privacy, if the internal algorithm to compute the empirical mean is designed to be $\epsilon$-differentially private, then following from the post-processing property, we can claim the bandit  algorithm itself is  $\epsilon$-differentially private. %Note that constructing upper confidence bounds, eliminating arms, or generating posterior samples can all be viewed as post-processing.
This property has indeed been used in the design of private algorithms in previous work% Keeping the aforementioned idea in mind,    the learning algorithms for differentially private stochastic bandits have been presented in 
~\citep{mishra2015nearly,tossou2016algorithms,sajed2019optimal,hu2021optimal}. 




\citet{mishra2015nearly} present the first differntially-private versions of the UCB and Thompson Sampling-based algorithms.  However, the regret bounds they derive, $O \left( \frac{K\log^3(T)}{\epsilon \cdot \Delta} \right)$ and $O \left( \frac{K\log^3(T)}{\epsilon^2 \cdot \Delta^2} \right)$, are far from the $\Omega \left(\frac{K\log(T)}{\Delta} + \frac{K\log(T)}{\epsilon} \right)$ regret lower bound that is derived later by \citet{shariff2018differentially}.  The key reason for the sub-optimality is using the $T$-bounded Binary Mechanism\footnote{\citet{dwork2010differential}, call it the Tree-based Mechanism, but the core idea is identical.} \citep{dwork2010differential,chan2011private} to add  random noise to mask the empirical mean  for an arm. Furthermore, since  their algorithms need to know the time horizon $T$ in advance to calibrate the distribution of the random noise, they cannot be anytime learning algorithms.   %{\color{red}their learning algorithms cannot be considered private at any time $t < T$. }
More importantly,  their Thompson Sampling-based  algorithm has some operational issues: in some rounds, %their algorithm cannot work as
the total reward $r_a(t)$ %% why is this subscripted with a - accumulated? 
computed by the tree-based mechanism can take negative values, resulting in invalid  parameters for the posterior distribution, the Beta distribution.  Note that the parameters of Beta distributions must be non-negative.  Our proposed learning algorithms carefully use clipping to address this issue.%address this issue by using clipping.

Recently, two optimal algorithms have been proposed for differentially private stochastic bandits.   \citet{sajed2019optimal} propose DP-SE, an optimal elimination-style algorithm,  and   \citet{hu2021optimal} propose Anytime-Lazy-UCB, an optimal UCB-based algorithm.  The key idea in achieving optimality is to use fresh observations to compute the differentially private empirical means, thus minimizing the number of noise variables needed.  %Then, the number of noise variables required is minimized. 
Although DP-SE, Anytime-Lazy-UCB, and our proposed Lazy-DP-TS are all optimal, 
as will be shown in Section~\ref{sec: experiments}, Lazy-DP-TS always outperforms the other two algorithms.


 %\vspace{-1\baselineskip}






%\section{DP-TS and Lazy-DP-TS} \label{sec: dp-ts and lazy-dp-ts}
\section{Algorithms and Analysis} \label{sec: dp-ts and lazy-dp-ts}

We now present our algorithms for achieving differential privacy in the stochastic bandit setting.  %  algorithms: DP-TS and Lazy-DP-TS.
The algorithms rely on two key ideas.  %There are two key ideas supporting  our proposed  algorithms. 
The first  is to use the differential privacy property of invariance to post-processing %in differential privacy 
to make the arm selection algorithm differentially private due to the internal algorithm of computing empirical means being differentially private.  
The second  is 
 based on the  principle of optimism in the face of uncertainty. 
Note that the decisions for the Thompson Sampling-based algorithms fully depend on the generated random  samples from the posterior distributions. %In the design of differentially private Thompson Sampling-based algorithms, 
 Operating under the optimism principle, we reshape the posterior distribution in an optimistic way:  we shift the posterior distribution in the private algorithm towards the right as compared to the posterior distribution for the non-private Thompson Sampling.  This shifting makes it %The purpose of the shifting is, for each arm, it is 
 more likely to draw a ``good'' posterior sample as compared to the draw in the non-private setting. 


While both our algorithms rely on these fundamental concepts, the key difference between them 
%The key difference between DP-TS and Lazy-DP-TS  
lies in the design of the internal algorithm  to compute the differentially private empirical means.  
Our first algorithm, Differentially Private Thompson Sampling (DP-TS), uses all the observations %rewards
from the beginning in computing the differentially private empirical means, whereas our second algorithm, Lazy Differentially Private Thompson Sampling (Lazy-DP-TS) uses only a subsequence of observations.

Suppose at a given time step, arm $j$ has a sequence of  $n$ observations $(x_1, x_2, \dotsc, x_n)$.
 In DP-TS, all $n$ observations will be used to compute the differentially private empirical mean for  arm $j$. 
We partition $(x_1, x_2, \dotsc, x_n)$ into $(x_1, x_2, \dotsc, x_m)$ and $(x_{m+1}, x_{m+1}, \dotsc, x_n)$, where $m = 2^{\lfloor \log(n+1) \rfloor}-1$. This partition guarantees $n-m \le m$, i.e., the length of the first subsequence is always no smaller than the length of the second subsequence.
The internal algorithm  
composes two differentially private  mechanisms, each being $0.5\epsilon$-differentially private and acting on each partition, respectively, to process these $n$ observations. 

The first mechanism is a modified version of the Logarithmic Mechanism \citep{chan2011private} and works over $(x_1, x_2, \dotsc, x_m)$:  %random noise will be added  and  
a differentially private aggregated reward of these $m$ observations will be computed. %The key  difference between the original Logarithmic Mechanism  and our presented modified version lies in when to inject noise. 
According to the original mechanism by \citet{chan2011private}, random noise would be added to the reward of an arm whenever the number of observations of that arm hits $2^r$, for all $r\ge 0$, while in our modified version, random noise is added whenever the number of observations hits $2^{r+1} - 1$, so that fresh noise is added at longer epochs, resulting in less overall noise. The second mechanism is the bounded Binary Mechanism  \citep{chan2011private} and works over $(x_{m+1}, x_{m+2}, \dotsc, x_n)$:  random noise will be added based on the bounded Binary Mechanism and  a differentially private aggregated reward of these $n-m$ observations will be output. The differentially private empirical mean is thus computed by aggregating the outputs of these two mechanisms.  

Note that a given observation may be used more than once in the calculation of the empirical means over rounds, which means more noise is required to maintain the same degree of privacy. %% it's redundant to say "required" and "to be added", so you can remove "to be added" 
Based on this remark, we propose Lazy-DP-TS, where
the internal algorithm only uses a subsequence of all the observations obtained so far to compute the differentially private empirical mean and no observation can be reused, i.e., once an observation has been used, it will be abandoned. The length of the subsequences double each time, i.e., the internal algorithm adds a random noise to every $2^r, r \ge 0$ observations and outputs a differentially private empirical mean.  This restriction of using an observation only once in the calculation of the empirical mean minimizes added noise and is thus the key to the optimality of differentially private online learning algorithms.  



 %\vspace{-1\baselineskip}


\paragraph{Notation.} Let $\text{Beta}(\alpha, \beta)$ be a Beta distribution with parameters $\alpha, \beta$  and $\text{Lap}(b)$ be a Laplace distribution centered at $0$ with scale $b$.
The pdfs of $\text{Beta}(\alpha, \beta)$ and $\text{Lap}(b)$ are shown in  Appendix. Also,
$\log(x)$ is the base-$2$ logarithm
of $x$ and $\ln(x)$ is the base-$e$ logarithm
of $x$. 


\subsection{DP-TS} \label{sec: dp-ts}
We now present DP-TS, followed by its guarantees.

% \vspace{-0.5\baselineskip}
 
\subsubsection{Algorithm}

\begin{algorithm}[!ht]
	\caption{DP-TS}
	\label{Private TS2}
	\begin{algorithmic}[1]
	\STATE {\bf{Input}:} Arm set $\mathcal{A}$ and privacy parameter $\epsilon$
	
%	\
	
%	 $\%$ Initialization phase starts
	\FOR {$t = 1,2, \dotsc, K$} \label{ini-start}
\STATE Pull $J_t \leftarrow t$; Set $O_{J_t} \leftarrow 1$, $\Psi_{J_t} \leftarrow \{\}$, $C_{J_t} \leftarrow X_{J_t}(t) + \text{Lap} \left(\frac{1}{0.5\epsilon} \right) $, $r_{J_t} \leftarrow 0$, $B_{J_t} \leftarrow 0$, $\widetilde{\mu}_{J_t, O_{J_t}} \leftarrow \frac{C_{J_t} + B_{J_t}}{O_{J_t}}$ 
\ENDFOR \label{ini-end}

 %$\%$ Initialization phase ends
 
% \

	\FOR {$t = K+1, K+2, \dotsc$}
	

	 
\FOR {$j \in \mathcal{A}$} \label{post-pro 1}

\STATE Set $\overline{\mu}_{j, O_j}$  

$=
   \max \left\{0, \min \left\{\widetilde{\mu}_{j,O_j} + \frac{6\sqrt{8} \log(O_j+1) \log(t)}{\epsilon \cdot O_j},1 \right\} \right\}$ 


\STATE Set $\widetilde{\alpha}_j \leftarrow \overline{\mu}_{j, O_j} \cdot O_j$, $\widetilde{\beta}_j \leftarrow (1-\overline{\mu}_{j, O_j} )\cdot O_j$  
\STATE Sample $\theta_j(t) \sim \text{Beta}(\widetilde{\alpha}_j + 1, \widetilde{\beta}_j + 1)$ 
\ENDFOR
\STATE Pull arm $J_t \in \mathop{\arg\max}_{j \in \mathcal{A}} \theta_j(t)$ \label{post-pro 2}

%\

%	 $\%$ Process observations starts
	 
	 
	 
\STATE Set $O_{J_t} \leftarrow O_{J_t} + 1$; 
Append $X_{J_t}(t)$   to  $\Psi_{J_t}$ \label{internal 1}


\IF {$O_{J_t} = \sum\limits_{s = 0}^{r_{J_t}+1} 2^s$}
\STATE Set $C_{J_t} \leftarrow C_{J_t} + \sum\Psi_{J_t} +\text{Lap} \left(\frac{1}{0.5\epsilon} \right)$ \label{C_j}
\STATE Set $\Psi_{J_t} \leftarrow  \{\}$, $r_{J_t} \leftarrow r_{J_t} + 1$, $B_{J_t} \leftarrow 0$
\ELSE
\STATE Invoke $2^{r_{J_t}+1}$-bounded Binary Mechanism  with Input $\left(0.5\epsilon,\Psi_{J_t} \right)$ and Output $B_{J_t}$ \label{B_j}
\ENDIF


\STATE Set $\widetilde{\mu}_{J_t, O_{J_t}} \leftarrow \frac{C_{J_t} + B_{J_t}}{O_{J_t}}$\quad.
\label{internal 2}

 %$\%$ Process observations ends

%\ 

\ENDFOR	
\end{algorithmic}
\end{algorithm}
% \vspace{-1\baselineskip}
 
We first present some notation specific to this algorithm. 
  $O_j(t-1) :=\sum_{s=1}^{t-1} \bm{1} \left\{J_s = j\right\}$ counts the number of pulls of arm $j$ by the end of round $t-1$ and $\widehat{\mu}_{j,O_j(t-1)} $ is the empirical mean  over these $O_j(t-1)$ observations. Let $\widetilde{\mu}_{j, O_j(t-1)}$ be the  private empirical mean, i.e.,  $\widehat{\mu}_{j,O_j(t-1)} $ plus some  noise. % %% this statement is not needed; we know by now the role of the added noise. The added noise is to mask $\widehat{\mu}_{j,O_j(t-1)} $. %At the end of this section, we  summarize the distributions of the noise variables.

DP-TS is presented in Algorithm
~\ref{Private TS2}.
Lines~\ref{ini-start} to \ref{ini-end}  initialize the algorithm. We pull each arm once and set $\Psi_j = \{\}$ to hold future observations.
Let $C_j$  track the differentially private aggregated reward computed by  the modified Logarithmic Mechanism and $B_j$  track the  private aggregated reward returned by the  Binary Mechanism. 
Since for each arm the modified Logarithmic Mechanism processes observations in epochs, we use $r_j$ to index the arm-specific epoch, i.e., the modified Logarithmic Mechanism will  add a noise  variable  to mask the aggregated reward of $2^{r_j}$ observations at the end of epoch $r_j$. 
We initialize $r_j=0$ and the initialization phase adds random noise to the first observation.


 
Let $\upsilon_{\epsilon, O_j(t-1), t} := \frac{6\sqrt{8}\log(O_j(t-1)+1)\log(t)}{\epsilon \cdot O_j(t-1)}$. 
For all the rounds $t \ge K +1$, we first compute 
$
 \overline{\mu}_{j, O_j(t-1)} =
 \max \left\{0, \min \left\{\widetilde{\mu}_{j,O_j(t-1)} + \upsilon_{\epsilon, O_j(t-1), t},1 \right\} \right\}$. Note that the empirical means are clipped so that $\overline{\mu}_{j,O_j(t-1)} \in [0,1]$. 
 We set $\widetilde{\alpha}_j(t) := \overline{\mu}_{j, O_j(t-1)} \cdot O_j(t-1)$ and $\widetilde{\beta}_j(t) := \left(1-\overline{\mu}_{j, O_j(t-1)} \right) \cdot O_j(t-1)$. 
We then generate a random posterior sample $\theta_j(t) \sim \text{Beta} \left(\widetilde{\alpha}_j(t)+1, \widetilde{\beta}_j(t)+1 \right)$ for  each arm and   pull the arm with the highest sample, i.e., $J_t \in \mathop{\arg\max}_{j \in \mathcal{A}}\theta_j(t)$. Since  $\overline{\mu}_{j,O_j(t-1)} \in [0,1]$, the parameters of  Beta distribution are valid.



To  update the  private empirical mean of the pulled arm, we append $X_{J_t}(t)$ to  $\Psi_{J_t}$. If the number of observations in $\Psi_{J_t}$ hits $2^{r_{J_t}+1}$,  we add random noise drawn from $\text{Lap} \left(\frac{1}{0.5\epsilon}\right)$ and update $C_{J_t}$. Since now all observations in $\Psi_{J_t}$ are used by the modified Logarithmic Mechanism, we reset $\Psi_{J_t}$ and $B_{J_t}$, and increment $r_{J_t}$ by one. If the number of observations in $\Psi_{J_t}$ has not reached $2^{r_{J_t}+1}$, we invoke the $2^{r_{J_t}+1}$-bounded Binary Mechanism \citep{chan2011private} taking  $\Psi_{J_t}$ as input and preserving $0.5\epsilon$-differential privacy. Note the number of observations in  $\Psi_{J_t}$ is at most $2^{r_{J_t}+1}$.

{\bf{Remark}.} %Two remarks are in order for Algorithm~\ref{Private TS2}. 
(a) $r_j$ is  determined by $O_j(t-1)$ as $r_j$ will only increment by one whenever the number of observations in $\Psi_j$ hits $2^{r_j+1}$. Indeed, $r_j = \left\lfloor\log(O_j(t-1) + 1) \right\rfloor-1$. (b) Regarding the noise variables included in the  differentially private empirical mean, there are exactly $r_j+1$ i.i.d. random variables that are
drawn from $\text{Lap} \left( \frac{1}{0.5\epsilon} \right)$ and  at most $r_j+1$ i.i.d. random variables that are drawn from $\text{Lap} \left( \frac{r_j+1}{0.5\epsilon} \right)$. 

We now compare Algorithm~\ref{Private TS2} to the  non-private Thompson Sampling by \citet{agrawal2017near}.
Let $\alpha'_j(t) := \widehat{\mu}_{j, O_j(t-1)} \cdot O_j(t-1)$ be the number of  successes and  $\beta'_j(t) := \left(1-\widehat{\mu}_{j, O_j(t-1)} \right) \cdot O_j(t-1)$ be the number of failures among $O_j(t-1)$ Bernoulli trials. Recall that in the non-private Thompson Sampling, we draw $\theta'_j(t) \sim \text{Beta}\left(\alpha'_j(t)+1, \beta'_j(t)+1 \right)$. By adding $\upsilon_{\epsilon, O_j(t-1), t}$ to $\widetilde{\mu}_{j, O_j(t-1)}$, we have,  with high probability,  $\overline{\mu}_{j,O_j(t-1)} \ge \widehat{\mu}_{j, O_j(t-1)}$, i.e., the posterior distribution for the differentially private version is shifted towards the right as compared to the non-private version.  





 %\vspace{-0.5\baselineskip}


\subsubsection{Analysis}
We present privacy and regret guarantees for Algorithm~\ref{Private TS2}.
\begin{theorem}
Algorithm~\ref{Private TS2} is $\epsilon$-differentially private.
\end{theorem}
\begin{proof}
We first show the internal algorithm  to compute the empirical mean, i.e., from  Lines~\ref{internal 1} to \ref{internal 2}, is $\epsilon$-differentially private. Then, from Proposition~2.1 of \cite{dwork2014algorithmic}, we conclude that Algorithm~\ref{Private TS2} is $\epsilon$-differentially private. Note that Lines~\ref{post-pro 1} to \ref{post-pro 2} can be viewed as post-processing since in these steps, the learning algorithm does not  touch any revealed observations.
Suppose reward sequences $\bm{X}$ and $\bm{X}'$ differ in round $h$, i.e., the reward vectors $X_h = \left(X_1(h), \dotsc, X_K(h) \right)$ and $X'_h = \left(X'_1(h), \dotsc, X'_K(h) \right)$  are not the same. 
Note that changing from $X_h$ to $X'_h$ has no impact on other arms except arm $J_h$ as only the reward of the pulled arm, $J_h$, is revealed in round $h$. Let $J_h = j$.  
At the end of round $h$, the differentially private empirical mean of arm $j$ will be updated.
According to Algorithm~\ref{Private TS2},  changing from $X_j(h)$ to $X'_j(h)$ impacts $C_j$ by at most $1$. From Theorem~3.6 of \cite{dwork2014algorithmic}, we know the internal algorithm to compute $C_j$ (Line~\ref{C_j}) is $0.5\epsilon$-differentially private. From Theorem~3.5 of \cite{chan2011private}, we know the internal algorithm to compute $B_j$ (Line~\ref{B_j}) is $0.5\epsilon$-differentially private. Composing these two internal algorithms together, from Theorem~3.14 in \citep{dwork2014algorithmic}, we   conclude that the internal algorithm (Line~\ref{internal 2}) to compute the differentially private empirical mean  is $\epsilon$-differentially private. \end{proof}
\begin{theorem}
The regret $\mathcal{R}_{\text{DP-TS}}(T) $ of Algorithm~\ref{Private TS2}  is at most

$  \sum\limits_{j \in \mathcal{A}: \Delta_j > 0}
    O \left( \max \left\{ \frac{\log(T)}{\Delta_j},  \frac{\log(T)}{\epsilon} \log \left(\frac{\log(T)}{\epsilon \cdot \Delta_j} \right)\right\} \right) $\quad.
\label{DP-TS regret theorem}
\end{theorem}



{\bf{Remark}.} Several remarks are in order. (a): DP-TS is optimal up to a $\log\log (T)$ factor. (b): When setting $\epsilon\rightarrow\infty$, Algorithm~\ref{Private TS2} boils down to the same algorithm as the one by  \citet{agrawal2017near}. However, our derived regret bound, Theorem~\ref{DP-TS regret theorem}, is only order-optimal instead of  asymptotically optimal.  Note that the regret bound of the non-private Thompson Sampling can be asymptotically optimal, i.e., a regret bound  attaining the best possible coefficient for the leading term  asymptotically. (c): Algorithm~\ref{Private TS2} also has an $O \left( \sqrt{KT\log (T)} + \frac{K\log(T)}{\epsilon} \log \left( \frac{\sqrt{T \log(T)}}{\sqrt{K} \epsilon}  \right)\right)$ problem-independent regret bound. Note that it is  known that Thompson Sampling is able to achieve the   $\Omega \left(\sqrt{KT}\right)$ minimax lower bound for non-private stochastic bandits \citep{jin2021mots}. Therefore, the $O \left(\sqrt{KT \log(T)} \right)$ term in Theorem~\ref{DP-TS regret theorem} is $\sqrt{\log(T)}$ far from being minimax optimal. 
Note that the price of introducing differential privacy is 
$\Omega \left(  \frac{K\log(T)}{\epsilon}  \right)$ \citep{shariff2018differentially}. This lower bound implies
DP-TS is $\log \left( \frac{\sqrt{T \log(T)}}{\sqrt{K} \epsilon}  \right)$ far from being optimal in the private setting. 
The detailed proof for the problem-independent result is deferred to Appendix.

We now provide a proof sketch for Theorem~\ref{DP-TS regret theorem}. The detailed proof is deferred to Appendix. Let $\mathcal{F}_{t-1}$ collect all the history information containing the pulled arms, the rewards associated with the pulled arms, and the added noise. Define $\mathcal{F}_{0} = \{\}$.
Let $y_j := \mu_1 - \frac{\Delta_j}{6}$ and define
 $E_j^{\theta}(t)$ as the event  that $\left\{\theta_j(t) \le y_j \right\}$. 
 Let $C_j(t-1)$ be the event  that  $\left\{\left|\mu_j - \widehat{\mu}_{j, O_j(t-1)}  \right| \le \sqrt{\frac{3\log(t)}{O_j(t-1)}} \right\}$. Let $G_j(t-1)$ be the event  that $\left\{\left|\widehat{\mu}_{j,O_j(t-1)} - \widetilde{\mu}_{j, O_j(t-1)} \right| \le \upsilon_{\epsilon, O_j(t-1), t} \right\}$. 

\begin{proof}[Proof sketch of Theorem~\ref{DP-TS regret theorem}]  
 We upper bound $\mathbb{E}[O_j(T)]$. 
 Let
$\mathcal{L}_j :=  \max \left\{ \frac{108\log(T)}{\Delta_j^2},  \frac{72\log(T)}{\epsilon \cdot \Delta_j} \log \left(\frac{72\log(T)}{\epsilon \cdot \Delta_j} \right)\right\} $. 
We separate all $ T$ rounds into two regimes based on whether $O_j(t-1) \ge \mathcal{L}_j$. For all rounds $t$  s.t. $O_j(t-1) < \mathcal{L}_j $, the total regret is at most $\mathcal{L}_j \cdot \Delta_j$. In a round when $O_j(t-1) \ge \mathcal{L}_j $, w.t.p., we have 
$\overline{\mu}_{j,O_j(t-1)} \le \mu_j + \frac{4\Delta_j}{6}$, which implies  $\overline{E_j^{\theta}(t)}$ is a low probability event. Meanwhile,  w.h.p., we also have $\overline{\mu}_{1, O_1(t-1)} \ge \widehat{\mu}_{1, O_1(t-1)}$, which  allows us to reduce the proof to the non-private setting.

 With these ideas in hand, we have $\sum_{t=1}^{T} \mathbb{E} \left[\bm{1} \left\{J_t = j \right\} \right]$
 \begin{equation}
    \begin{array}{ll}
   %\le &\mathcal{L}_j +\sum\limits_{t = 1}^{T} \mathbb{P}  \left\{J_t = j, O_j(t-1) >\mathcal{L}_j \right\} \\
  \le &  \mathcal{L}_j + \underbrace{ \sum_{t = 1}^{T} \mathbb{P}   \left\{ \overline{C_j(t-1)} \right\} +\sum\limits_{t = 1}^{T}  \mathbb{P} \left\{ \overline{G_j(t-1)} \right\}}_{=:\omega_0} \\
    + & \underbrace{\sum\limits_{t = 1}^{T}\mathbb{P}  \left\{O_j(t-1) >\mathcal{L}_j, C_j(t-1), G_j(t-1), \overline{E_j^{\theta}(t)} \right\}}_{=:\omega_1} \\
  + & \underbrace{\sum\limits_{t = 1}^{T} \mathbb{P}  \left\{J_t = j,   E_j^{\theta}(t) \right\}}_{=:\omega_2}
     \quad.
    \end{array}
    \label{cloud 2}
\end{equation}

Via  well-known concentration inequalities, we have $\omega_0 \le O(1)$ (lemmas are shown in Appendix). For $\omega_1$, we use the  argument that if events $C_j(t-1)$ and $G_j(t-1)$ are  true simultaneously and arm $j$ has been pulled at least $\mathcal{L}_j$ times, we have $\overline{\mu}_{j, O_j(t-1)} \le \mu_j + \frac{4\Delta_j}{6}$.  Since  $\theta_j(t) \sim \text{Beta} \left(\widetilde{\alpha}_j(t)+1, \widetilde{\beta}_j(t)+1 \right)$, from the properties of the Beta distribution, we know that it is very unlikely to draw  $\theta_j(t) > \mu_j + \frac{5\Delta_j}{6}$.   In Appendix, we  show that $\omega_1 \le O(1)$.

The key challenge is to upper bound $\omega_2$. We first  reduce the proof to the non-private Thompson Sampling. Then, we reuse 
Lemmas~2.9 and 2.10 in \citep{agrawal2017near} to conclude the proof. 
Now, we show how to reduce the proof to the non-private setting. 
By introducing $G_1(t-1)$ and $\overline{G_1(t-1)}$, term $\omega_2$ is at most

$ \sum\limits_{t = 1}^{T} \mathbb{P}  \left\{J_t = j,    G_1(t-1), E_j^{\theta}(t) \right\} + \sum\limits_{t = 1}^{T} \mathbb{P}  \left\{\overline{ G_1(t-1)} \right\}$.
For the second term above, it is at most $O (1)$ (shown in Appendix).
For the first term above, we have
\begin{equation}
    \begin{array}{ll}
         & \sum\limits_{t = 1}^{T} \mathbb{P}  \left\{J_t = j,    G_1(t-1), E_j^{\theta}(t) \right\} \\
       \le  &  \mathbb{E} \left[\sum\limits_{t = 1}^{T} \frac{\mathbb{P} \left\{\theta_1(t) \le y_j \mid \mathcal{F}_{t-1}\right\}}{1-\mathbb{P} \left\{\theta_1(t) \le y_j \mid \mathcal{F}_{t-1} \right\}} \left\{J_t = 1,    G_1(t-1) \right\} \right]\\
       \le  &  \mathbb{E} \left[\sum\limits_{t = 1}^{T} \frac{\mathbb{P} \left\{\theta'_1(t) \le y_j \mid \mathcal{F}_{t-1}\right\}}{1-\mathbb{P} \left\{\theta'_1(t) \le y_j \mid \mathcal{F}_{t-1} \right\}} \left\{J_t = 1 \right\} \right],
    \end{array}
    \label{cloud 9}
\end{equation}
where $\theta_1'(t) \sim \text{Beta} \left(\alpha'_j(t)+1, \beta'_j(t)+1 \right)$, the non-private posterior distribution for arm $j$ conditioned on $\mathcal{F}_{t-1}$.

The first inequality in (\ref{cloud 9}) links the probability of pulling a sub-optimal $j$  to the probability of pulling the best arm by using a lemma that we develop in Appendix.
The last inequality uses the fact that if $\overline{\mu}_{1, O_1(t-1)} \ge \widehat{\mu}_{1,O_1(t-1)}$, we have $\mathbb{P} \left\{\theta_1(t) \le y_j \mid \mathcal{F}_{t-1}  \right\} \le \mathbb{P} \left\{\theta_1'(t) \le y_j \mid \mathcal{F}_{t-1}  \right\}$, i.e., $\text{Beta} \left(\widetilde{\alpha}_j(t)+1, \widetilde{\beta}_j(t)+1 \right)$  stochastically dominates $\text{Beta} \left(\alpha'_j(t)+1, \beta'_j(t)+1 \right)$. 
Since the proof now is reduced to the non-private setting, slightly modifying 
Lemmas~2.9 and 2.10 in \citep{agrawal2017near} concludes the proof.
In Appendix, we show   $\omega_2 \le O \left(\frac{\log(T)}{\Delta_j^2} \right)$.\end{proof}

 %\vspace{-1\baselineskip}
 
\subsection{Lazy-DP-TS} \label{sec: lazy-dp-ts}
We now present Lazy-DP-TS and its guarantees. The  idea to achieve optimality is limiting the number of times an observation is used in computing the empirical mean to one. 



\subsubsection{Algorithm}
\begin{algorithm}[!ht]
	\caption{Lazy-DP-TS}
	\label{Optimal DP-TS}
	\begin{algorithmic}[1]
	\STATE {\bf{Input}:} Arm set $\mathcal{A}$ and privacy parameter $\epsilon$
	%\STATE {\bf{Initialization}:} 
		\FOR {$t = 1,2, \dotsc, K$} \label{ini-start 2}
	\STATE Pull $J_t \leftarrow t$; Set $O_{J_t} \leftarrow 1 $, $\widetilde{\mu}_{J_t, O_{J_t}}  \leftarrow X_{J_t}(t)  + \text{Lap} \left( \frac{1}{\epsilon} \right)$,  $r_{J_t} \leftarrow 0$, $\Psi_{J_t} \leftarrow \{\}$
	\ENDFOR \label{ini-end 2}

	\FOR {$t = K+1, K+2, \dotsc$} \label{post 7}


\FOR {$j \in \mathcal{A}$}

\STATE Set 
$\overline{\mu}_{j, O_j} =\max \left\{  0, \min \left\{ \widetilde{\mu}_{j, O_j}  +  \frac{3\log(t)}{\epsilon \cdot O_j }, 1 \right\} \right\}$ 


\STATE Set $\widetilde{\alpha}_j \leftarrow \overline{\mu}_{j, O_j} \cdot O_j$, 
 $\widetilde{\beta}_j \leftarrow (1-\overline{\mu}_{j, O_j} )\cdot O_j$  
\STATE Sample $\theta_j(t) \sim \text{Beta}(\widetilde{\alpha}_j + 1, \widetilde{\beta}_j + 1)$ 
\ENDFOR
\STATE Pull  $J_t \in \mathop{\arg\max}_{j \in \mathcal{A}} \theta_j(t)$ \label{post 8}
\STATE Append $X_{J_t}(t)$ to  $\Psi_{J_t}$ \label{post 3}

	\IF {number of observations in $\Psi_{J_t}$ hits $2^{r_{J_t}+1}$}
	\STATE Set $O_{J_t} \leftarrow 2^{r_{J_t}+1}$, $\widetilde{\mu}_{J_t, O_{J_t}} \leftarrow \frac{\sum \Psi_{J_t} +\text{Lap} \left(\frac{1}{\epsilon}\right)}{O_{J_t}}$
	\STATE Set $r_{J_t} \leftarrow r_{J_t} + 1$,
	 $\Psi_{J_t} \leftarrow \left\{ \right\}$
	\ENDIF \label{post 4}


\ENDFOR	
\end{algorithmic}
\end{algorithm}
We first present some notation specific to this algorithm.
Let $O_j(t-1)$ denote the number of observations that are used to compute the differentially private empirical mean and $\widehat{\mu}_{j, O_j(t-1)}$ denote the empirical mean of these $O_j(t-1)$ observations. 
Let $\widetilde{\mu}_{j,O_j(t-1)}$ be the differentially private empirical mean. 


Lazy-DP-TS is presented in Algorithm~\ref{Optimal DP-TS}.
Lines~\ref{ini-start 2} to \ref{ini-end 2} are the initialization. We pull each arm once and add random noise that is drawn from $\text{Lap} \left(\frac{1}{\epsilon}\right)$ to the obtained observation to initialize the differentially private empirical mean. We still use $2^{r_j}$ to track the number of observations that have been used to compute the differentially private empirical mean for arm $j$. Initially, we set $r_j = 0$ and $\Psi_j = \{\}$ to hold future observations.

For all rounds $t \ge K+1$, we first compute
 $\overline{\mu}_{j, O_j(t-1)} =\max \left\{  0, \min \left\{ \widetilde{\mu}_{j, O_j(t-1)}  +  \frac{3\log(t)}{\epsilon \cdot O_j(t-1) }, 1 \right\} \right\}$ and  then compute  $\widetilde{\alpha}_j(t) := \overline{\mu}_{j, O_j(t-1)} \cdot O_j(t-1)$ and $\widetilde{\beta}_j(t) := \left(1- \overline{\mu}_{j, O_j(t-1)} \right) \cdot O_j(t-1)$. Next, we generate a posterior sample $\theta_j(t) \sim \text{Beta} \left(\widetilde{\alpha}_j(t) +1, \widetilde{\beta}_j(t)+1 \right)$ for each arm and pull the arm with the highest posterior sample, i.e., $J_t \in \mathop{\arg\max}_{j \in \mathcal{A}}\theta_j(t)$. 
 
 To process $X_{J_t}(t)$, we append it in $\Psi_{J_t}$.
 However, we may not update the differentially private empirical mean of the pulled arm in round $t$. We will only update it when the number of observations in $\Psi_{J_t}$ hits $2^{r_{J_t}+1}$ and the updated differentially private empirical mean will  be based on observations  in $\Psi_{J_t}$ only, i.e., the updated differentially private empirical mean is computed by adding a noise variable drawn from $\text{Lap}\left(\frac{1}{\epsilon} \right)$ to these fresh $2^{r_{J_t}+1}$ observations. Since now all observations in  $\Psi_{J_t}$ are used, we  reset $\Psi_{J_t}$ and increment $r_{J_t}$ by one.

{\bf{Remark}.} %We have two remarks for Algorithm~\ref{Optimal DP-TS}.
(a) The number of observations  used to compute the differentially private empirical mean doubles each time, i.e., $O_j(t-1)$ takes values from $2^{r_j}, r_j \ge 0$. (b) The number of  noise  variables included in the   private empirical mean of arm $j$ is always 1 and it is drawn from $\text{Lap}\left(\frac{1}{\epsilon} \right)$.  






	
\subsubsection{Analysis}
We now present privacy and regret guarantees for Algorithm~\ref{Optimal DP-TS}.
\begin{theorem}
Algorithm~\ref{Optimal DP-TS} is $\epsilon$-differentially private.
\end{theorem}
\begin{proof}
The internal algorithm to compute the differentially private empirical mean is shown in Lines \ref{post 3} to \ref{post 4} in Algorithm~\ref{Optimal DP-TS}. Lines~\ref{post 7} to \ref{post 8} can be viewed as post-processing. Now, we show that the internal algorithm is $\epsilon$-differentially private.
Suppose reward sequences $\bm{X}$ and $\bm{X}'$ differ in round $h$, i.e., the reward vectors $X_h = \left(X_1(h), \dotsc, X_K(h) \right)$ and $X'_h = \left(X'_1(h), \dotsc, X'_K(h) \right)$  are not the same. The changing from $X_h$ to $X'_h$ can only impact arm $J_h$. Let $J_h = j$. 
Since arm $j$'s differentially private means are always based on fresh observations, the changing from $X_h$ to $X'_h$ can only impact the differentially private aggregated reward  of arm $j$ once and by at most 1. By adding a noise  variable drawn from $\text{Lap} \left(\frac{1}{\epsilon} \right)$ to $\sum\Psi_{j}$, from Theorem~3.6 in \citep{dwork2014algorithmic}, we know that the internal algorithm to compute the differentially private empirical mean is $\epsilon$-differentially private.\end{proof}
\begin{theorem}
The regret $\mathcal{R}_{\text{Lazy-DP-TS}}(T) $ of Algorithm~\ref{Optimal DP-TS} is at most
$\sum\limits_{j \in \mathcal{A}: \Delta_j >0}O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\}} \right)$\quad.
\label{Lazy-DP-TS regret theorem}
\end{theorem}

%{\color{red}ADD problem-dependent regret bound and discussion on minimax optimal}

{\textbf{Remark}.} Several remarks are in order. (a): Lazy-DP-TS is  (order)-optimal  as its regret upper bound matches the  $\Omega\left(\sum\limits_{j \in \mathcal{A}:\Delta_j >0} \frac{\log(T)}{\Delta_j} + \frac{\log(T)}{\epsilon}\right)$ regret lower bound of \citet{shariff2018differentially}. Our Lazy-DP-TS preserves the same regret guarantee as the one for Anytime-Lazy-UCB by \citet{hu2021optimal} and DP-SE by \citet{sajed2019optimal}. However, as will be shown in Section~\ref{sec: experiments}, Lazy-DP-TS has better practical performance than Anytime-Lazy-UCB and DP-SE.
Since Algorithm~\ref{Optimal DP-TS} drops observations as it learns, even if we set $\epsilon \rightarrow \infty$, the regret bound can never be asymptotically optimal.  (b): Algorithm~\ref{Optimal DP-TS} also has an $O \left(\sqrt{ KT \log (T)} + \frac{K\log(T)}{\epsilon} \right)$ problem-independent regret bound. Since the price of introducing differential privacy is 
$\Omega \left(  \frac{K\log(T)}{\epsilon}  \right)$, 
the $O\left(\frac{K\log(T)}{\epsilon} \right)$ term in Theorem~\ref{Lazy-DP-TS regret theorem} cannot be improved as it matches the lower bound of introducing differential privacy. Therefore, Lazy-DP-TS is minimax optimal up to a $\sqrt{\log(T)}$ factor in both private setting and non-private setting. 
The detailed proof for the  problem-independent result is deferred to Appendix.



We now present a proof sketch for Theorem~\ref{Lazy-DP-TS regret theorem}. The detailed proof is deferred to Appendix. We still define $C_j(t-1)$ as the event that the confidence interval of the empirical mean holds and $G_j(t-1)$ as the event that the noise injected is not too much. Let $\mathcal{F}_{t-1}$ collect all the history information and set $y_j := \mu_1 - \frac{\Delta_j}{6}$. Let event $E_j^{\theta}(t) := \left\{\theta_j(t) \le y_j  \right\}$.

\begin{proof}[Proof sketch of Theorem~\ref{Lazy-DP-TS regret theorem}]
We still upper bound the expected number of pulls of a sub-optimal arm $j$. However, we cannot separate all $T$ rounds into two regimes since Algorithm~\ref{Optimal DP-TS} drops observations.  %along with  learning. 
Instead, we perform a decomposition as follows. 
\begin{equation}
    \begin{array}{ll}
  & \sum\limits_{t = 1}^{T}\mathbb{E} \left[ \bm{1} \left\{ J_t = j \right\}  \right] \\
 
    \le & \underbrace{\sum\limits_{t = 1}^{T}\mathbb{P}  \left\{J_t = j, C_j(t-1), G_j(t-1), \overline{E_j^{\theta}(t)} \right\}}_{=:\omega_1} \\
  + & \underbrace{\sum\limits_{t = 1}^{T} \mathbb{P}  \left\{J_t = j,   E_j^{\theta}(t), G_1(t-1) \right\}}_{=:\omega_2} + O(1) \quad.
    \end{array}
    \label{cloud 22}
\end{equation}
 Note that the $O(1)$ term in (\ref{cloud 22}) is an upper bound on  $\sum\limits_{t = 1}^{T} \mathbb{P}   \left\{ \overline{C_j(t-1)} \right\} +  \mathbb{P} \left\{ \overline{G_j(t-1)} \right\} +  \mathbb{P} \left\{ \overline{G_1(t-1)} \right\}$.

To upper bound $\omega_1$,  we let $\mathcal{L}_j := \frac{72 \cdot \log(T)}{\Delta_j \cdot \min \left\{ \epsilon, \Delta_j \right\}}$ and $d_j := \log \left(\mathcal{L}_j \right)$. Recall that for  arm $j$, the numbers of observations that are used to compute the differentially private empirical means are $2^{r_j}$ for $0 \le r_j \le \log(T)$. Let $\tau_{r_j}$ be the round such that at the end of round $\tau_{r_j}$, the learning algorithm will use $2^{r_j}$ observations to update the differentially private empirical mean for arm $j$. We separate $0 \le r_j \le \log(T)$ into two parts. The first part is when $0 \le r_j \le d_j$. Based on the definition of $\tau_{r_j}$, we know that   the total number of pulls of arm $j$ is at most $\sum\limits_{s=0}^{d_j}2^s \le O\left(\frac{  \log(T)}{\Delta_j \cdot \min \left\{ \epsilon, \Delta_j \right\}} \right)$ in all rounds up to (and including) $ \tau_{d_j}$. %(including round $ \tau_{d_j}$). 
When $d_j < r_j \le \log(T)$, we have $2^{r_j} > \mathcal{L}_j$, i.e., we have accumulated ``enough'' observations for arm $j$. For a fixed $r_j$, with high probability, the expected number of pulls of arm $j$ is at most $O(1)$ in all rounds $t \in \left\{ \tau_{ r_j}+1, \dotsc, \tau_{ r_j+1} \right\}$. Then, we know that the total expected number of pulls  is at most $O(\log(T))$ in all rounds from $\tau_{d_j}+1$ up to $T$. In Appendix, we show $\omega_1 \le O \left(\frac{\log(T)}{\Delta_j \cdot \min \left\{\epsilon, \Delta_j  \right\}} \right)$.




The challenge still lies in upper bounding $\omega_2$. We again use the ideas shown in (\ref{cloud 9}) to reduce the proof to the non-private setting. We have 
\begin{equation}
    \begin{array}{l}
    \omega_2 \le \mathbb{E} \left[\sum\limits_{t = 1}^{T} \frac{\mathbb{P} \left\{\theta'_1(t) \le y_j \mid \mathcal{F}_{t-1}\right\}}{1-\mathbb{P} \left\{\theta'_1(t) \le y_j \mid \mathcal{F}_{t-1} \right\}} \left\{J_t = 1 \right\} \right]\quad.
    \end{array}
\end{equation}
However, now we cannot reuse Lemmas~2.9 and 2.10 from~\citep{agrawal2017near} directly due to the fact that the observations for arm $1$ are also dropped during the learning. 
To tackle this challenge,
we  separate all $T$ rounds into multiple intervals based on whether arm $1$'s empirical mean is updated or not. Let $\tau_{r}$ be the round such that at the end of round $\tau_{r}$, the learning algorithm will use $2^{r}$ observations for arm $1$ to update arm $1$'s empirical mean, i.e., in all rounds $t \in \left\{\tau_{r} +1, \dotsc, \tau_{r+1} \right\}$, the posterior distribution for $\theta'_1(t)$ stays the same. Then, we have
\begin{equation}
    \begin{array}{lll}
    \omega_2 &\le &\mathbb{E} \left[\sum\limits_{r = 0}^{\log(T)}\sum\limits_{t = \tau_r + 1}^{\tau_{r+1}} \frac{\mathbb{P} \left\{\theta'_1(t) \le y_j \mid \mathcal{F}_{t-1}\right\}}{\mathbb{P} \left\{\theta'_1(t) > y_j \mid \mathcal{F}_{t-1} \right\}} \left\{J_t = 1 \right\} \right] \\
    & = &\sum\limits_{r = 0}^{\log(T)} \mathbb{E} \left[\frac{\mathbb{P} \left\{\theta'_1(\tau_r + 1) \le y_j \mid \mathcal{F}_{\tau_r}\right\}}{\mathbb{P} \left\{\theta'_1(\tau_r + 1) > y_j \mid \mathcal{F}_{\tau_r} \right\}} \sum\limits_{t = \tau_r + 1}^{\tau_{r+1}}  \left\{J_t = 1 \right\} \right] \\
    & \le &\sum\limits_{r = 0}^{\log(T)} 2^{r+1} \cdot \mathbb{E} \left[\frac{\mathbb{P} \left\{\theta'_1(\tau_r + 1) \le y_j \mid \mathcal{F}_{\tau_r}\right\}}{\mathbb{P} \left\{\theta'_1(\tau_r + 1) > y_j \mid \mathcal{F}_{\tau_r} \right\}}  \right] \;.
    \end{array}
\end{equation}
The last inequality uses the fact that the number of pulls for arm $1$ in all rounds $t \in \left\{\tau_r +1, \dotsc, \tau_{r+1} \right\}$ is at most $2^{r+1}$ based on the definition of $\tau_{r+1}$. Let $d_1 := \log \left(\frac{8}{\mu_1 - y_j} \right)$.  We now analyze two cases separately based on whether $0 \le r \le \left\lfloor d_1 \right\rfloor$ or $ r \ge \left\lceil d_1 \right\rceil$.
By using Lemma~2.9 of \cite{agrawal2017near} and other analysis, we have  $\sum\limits_{r=0}^{\left\lfloor d_1 \right\rfloor} 2^{r+1} \mathbb{E} \left[\frac{\mathbb{P} \left\{\theta'_1(\tau_r + 1) \le y_j \mid \mathcal{F}_{\tau_r}\right\}}{\mathbb{P} \left\{\theta'_1(\tau_r + 1) > y_j \mid \mathcal{F}_{\tau_r} \right\}}  \right]  \le O \left( \frac{1}{\Delta_j^2} \right)$ and 
      $\sum\limits_{r=\left\lceil d_1 \right\rceil}^{\log(T)} 2^{r+1} \mathbb{E} \left[\frac{\mathbb{P} \left\{\theta'_1(\tau_r + 1) \le y_j \mid \mathcal{F}_{\tau_r}\right\}}{\mathbb{P} \left\{\theta'_1(\tau_r + 1) > y_j \mid \mathcal{F}_{\tau_r} \right\}}  \right]  \le O \left( \frac{\log(T)}{\Delta_j^2} \right)$. In Appendix, we show $\omega_2 \le O \left( \frac{\log(T)}{\Delta_j^2} \right)$.\end{proof}
\section{Experimental Results} \label{sec: experiments}

\begin{figure}[!ht]
\includegraphics[width=0.4\textwidth]{500Four.eps}
\caption{$\epsilon = 500$}
\label{eps = 500v2}
\end{figure}

\begin{figure}[!ht]
\includegraphics[width=0.4\textwidth]{1.0Three.eps}
\caption{$\epsilon = 1.0$}
\label{eps = 1.0v2}
\end{figure}

\begin{figure}[!ht]
\includegraphics[width=0.4\textwidth]{0.5Three.eps}
\caption{$\epsilon = 0.5$}
\label{eps = 0.5v2}
\end{figure}

\begin{figure}[!ht]
\includegraphics[width=0.4\textwidth]{0.25Three.eps}
\caption{$\epsilon = 0.25$}
\label{eps = 0.25}
\end{figure}

\begin{figure}[!ht]
\includegraphics[width=0.4\textwidth]{0.1Three.eps}
\caption{$\epsilon = 0.10$}
\label{eps = 0.10}
\end{figure}

 %\vspace{-0.5\baselineskip}



We compare the practical performance among DP-TS, Lazy-DP-TS, DP-SE, and Anytime-Lazy-UCB  under the experimental setting that has been used in \citep{sajed2019optimal}, i.e., we have $K = 5$ arms with mean rewards setting to $0.75, 0.625, 0.5, 0.375, 0.25$ and the privacy parameter $\epsilon$ setting to  $0.1, 0.25, 0.5, 1.0, 500$. We set $T = 10^5$. 
Figure~\ref{eps = 500v2} shows the results of the setting where $\epsilon = 500$. It is not surprising that DP-TS outperforms Lazy-DP-TS as when $\epsilon$ is very large, DP-TS is asymptotically optimal while Lazy-DP-TS can only be order-optimal. Also, just as expected, Thompson Sampling-based algorithms outperform the UCB-based and elimination-style algorithms. 
Figures~\ref{eps = 1.0v2} to \ref{eps = 0.10} show the results of  the settings where $\epsilon = 1.0, 0.5, 0.25, 0.1$, we skip the plots of DP-TS as the practical performance of DP-TS is inferior to  the remaining three optimal algorithms when $\epsilon$ is very small. From the experimental results we can see that Lazy-DP-TS  always outperforms   DP-SE and Anytime-Lazy-UCB.
More experimental results, including comparison of private and non-private algorithms, can be found in Appendix.









\section{Conclusion} \label{sec: conclusion}


We have presented optimal Thompson Sampling-based  algorithms for differentially private stochastic bandits, filling a gap in the literature for differentially private online learning. The ideas used  in this paper also contribute to developing optimal algorithms for other settings such as differentially private combinatorial multi-armed bandits \citep{chen2020locally}.
Note that both the UCB and elimination-based algorithms are deterministic. 
So far, our proposed  algorithms have not used the unique feature that only Thompson Sampling-based algorithms have, the  randomness inherent in the learning algorithms.
An interesting future direction is the design of optimal  private Thompson Sampling-based algorithms using the fact that a random posterior sample may provide a degree of differential privacy for free \citep{wang2015privacy,foulds2016theory}. 

\section*{Acknowledgements}
This work is  supported by Amii Post-Doctoral Fellowships.  


\bibliography{uai2022-template}
% 




\end{document}
