Abstract: We address differentially private stochastic bandits. We present two (near)-optimal Thompson Sampling-based learning algorithms: DP-TS and Lazy-DP-TS. The core idea in achieving optimality is the principle of optimism in the face of uncertainty. We reshape the posterior distribution in an optimistic way as compared to the non-private Thompson Sampling. Our DP-TS achieves a $\sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\} )} \log \left(\frac{\log(T)}{\epsilon \cdot \Delta_j} \right) \right)$ regret bound, where $\mathcal{A}$ is the arm set, $\Delta_j$ is the sub-optimality gap of a sub-optimal arm $j$, and $\epsilon$ is the privacy parameter. Our Lazy-DP-TS gets rid of the extra $\log$ factor by using the idea of dropping observations. The regret of Lazy-DP-TS is $ \sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\}} \right)$, which matches the regret lower bound. Additionally, we conduct experiments to compare the empirical performance of our proposed algorithms with the existing optimal algorithms for differentially private stochastic bandits.
Supplementary Material: zip
4 Replies
Loading