Keywords: Contextual Bandits, Partial Observability, Thompson Sampling, Regret Bounds
TL;DR: We developed Thompson sampling for contextual bandits with partially observed contexts.
Abstract: Contextual bandits constitute a classical framework for interactive learning of best decisions subject to context information. In this setting, the goal is to sequentially learn arms of highest reward subject to the contextual information, while the unknown reward parameters of each arm need to be learned by experimenting it. Accordingly, a fundamental problem is that of balancing such experimentation (i.e., pulling different arms to learn the parameters), versus sticking with the best arm learned so far, in order to maximize rewards. To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partially observed contexts remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that adaptive experiments based on samples from the posterior distribution efficiently learn optimal arms. Specifically, we establish regret bounds that grow logarithmically with time. Extensive simulations for real-world data are presented as well to illustrate this efficacy.
Submission Number: 41
Loading