Keywords: Multi-Armed Bandits, Causal Inference, Online Learning, Instrumental Variables
Abstract: The deployment of Multi-Armed Bandits (MAB) has become commonplace in many economic applications. However, regret guarantees for even state-of-the-art linear bandit algorithms (such as Optimism in the Face of Uncertainty Linear bandit (OFUL)) make strong exogeneity assumptions w.r.t. arm covariates. This assumption is very often violated in many economic contexts and using such algorithms can lead to sub-optimal decisions. In this paper, we consider the problem of online learning in linear stochastic multi-armed bandit problems with endogenous covariates. We propose an algorithm we term BanditIV, that uses instrumental variables to correct for this bias, and prove an $\tilde{\mathcal{O}}(k\sqrt{T})$ upper bound for the expected regret of the algorithm. Further, in economic contexts, it is also important to understand how the model parameters behave asymptotically. To this end, we additionally propose $\epsilon$-BanditIV algorithm and demonstrate its asymptotic consistency and normality while ensuring the same regret bound. Finally, we carry out extensive Monte Carlo simulations to demonstrate the performance of our algorithms compared to other methods. We show that BanditIV and $\epsilon$-BanditIV significantly outperform other existing methods.
0 Replies
Loading