Keywords: Differential Privacy, Multi-armed Bandits, Regret Analysis, Stochastic Linear Bandits
TL;DR: We study bandits with $\epsilon$-global differential privacy. We prove regret lower-bounds showing a transition in hardness between low and high privacy regimes. We propose two near-optimal algorithms with matching regret upper bounds.
Abstract: We study the problem of multi-armed bandits with ε-global Differential Privacy (DP). First, we prove the minimax and problem-dependent regret lower bounds for stochastic and linear bandits that quantify the hardness of bandits with ε-global DP. These bounds suggest the existence of two hardness regimes depending on the privacy budget ε. In the high-privacy regime (small ε), the hardness depends on a coupled effect of privacy and partial information about the reward distributions. In the low-privacy regime (large ε), bandits with ε-global DP are not harder than the bandits without privacy. For stochastic bandits, we further propose a generic framework to design a near-optimal ε global DP extension of an index-based optimistic bandit algorithm. The framework consists of three ingredients: the Laplace mechanism, arm-dependent adaptive episodes, and usage of only the rewards collected in the last episode for computing private statistics. Specifically, we instantiate ε-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB. AdaP-KLUCB is the first algorithm that both satisfies ε-global DP and yields a regret upper bound that matches the problem-dependent lower bound up to multiplicative constants.
Supplementary Material: zip