**Abstract:**We address differentially private stochastic bandits. We present two (near)-optimal Thompson Sampling-based learning algorithms: DP-TS and Lazy-DP-TS. The core idea in achieving optimality is the principle of optimism in the face of uncertainty. We reshape the posterior distribution in an optimistic way as compared to the non-private Thompson Sampling. Our DP-TS achieves a $\sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\} )} \log \left(\frac{\log(T)}{\epsilon \cdot \Delta_j} \right) \right)$ regret bound, where $\mathcal{A}$ is the arm set, $\Delta_j$ is the sub-optimality gap of a sub-optimal arm $j$, and $\epsilon$ is the privacy parameter. Our Lazy-DP-TS gets rid of the extra $\log$ factor by using the idea of dropping observations. The regret of Lazy-DP-TS is $ \sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\}} \right)$, which matches the regret lower bound. Additionally, we conduct experiments to compare the empirical performance of our proposed algorithms with the existing optimal algorithms for differentially private stochastic bandits.

**Supplementary Material:**zip

4 Replies

Loading