Keywords: Contextual bandits, stochastic contextual linear bandits, active learning, sample efficiency, pure exploration, regret, experiment design
TL;DR: We present active learning algorithms for exploration in contextual stochastic linear bandits and demonstrate improved sample efficiency.
Abstract: A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions and naively (passively) sampling contexts from the underlying context distribution. However, in many practical scenarios---including online content recommendation, survey research, and clinical trials---practitioners can actively sample or recruit contexts based on prior knowledge of the context distribution. Despite this potential for _active learning_, the role of strategic context sampling in stochastic contextual linear bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove _instance-dependent_ theoretical guarantees demonstrating that our active context sampling strategy can improve over the minimax rate by up to a factor of $\sqrt{d}$, where $d$ is the linear dimension. We also show empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy, in tasks such as warfarin dose prediction and joke recommendation.
Submission Number: 151
Loading