Keywords: Differential Privacy, Multi-armed Bandits, Regret Analysis, Lower Bound
Abstract: As sequential learning algorithms are increasingly applied to real life, ensuring data privacy while maintaining their utilities emerges as a timely question. In this context, regret minimisation in stochastic bandits under $\epsilon$-global Differential Privacy (DP) has been widely studied. 
The present literature poses a significant gap between the best-known regret lower and upper bound in this setting, though they ``match in order''.
Thus, we revisit the regret lower and upper bounds of $\epsilon$-global DP bandits and improve both. 
First, we prove a tighter regret lower bound involving a novel information-theoretic quantity characterising the hardness of $\epsilon$-global DP in stochastic bandits. This quantity smoothly interpolates between Kullback–Leibler divergence and Total Variation distance, depending on the privacy budget $\epsilon$.
Then, we choose two asymptotically optimal bandit algorithms, i.e., KL-UCB and IMED, and propose their DP versions using a unified blueprint, i.e., (a) running in arm-dependent phases, and (b) adding Laplace noise to achieve privacy. For Bernoulli bandits, we analyse the regrets of these algorithms and show that their regrets asymptotically match our lower bound up to a constant arbitrary close to 1.
At the core of our algorithms lies a new concentration inequality for sums of Bernoulli variables under Laplace mechanism, which is a new DP version of the Chernoff bound.
Finally, our numerical experiments validate that DP-KLUCB and DP-IMED achieve lower regret than the existing $\epsilon$-global DP bandit algorithms.
Supplementary Material:  zip
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 12475
Loading