Active Learning for Stochastic Contextual Linear Bandits

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: Contextual bandits, stochastic contextual linear bandits, active learning, sample efficiency, pure exploration, regret, experiment design
Abstract: A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions and naively (passively) sampling contexts from the underlying context distribution. However, in many practical scenarios---including online content recommendation, survey research, and clinical trials---practitioners can actively sample or recruit contexts based on prior knowledge of the context distribution. Despite this potential for _active learning_, the role of strategic context sampling in stochastic contextual linear bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove _instance-dependent_ theoretical guarantees demonstrating that our active context sampling strategy can improve over the minimax rate by up to a factor of $\sqrt{d}$, where $d$ is the linear dimension. We also show empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy, in tasks such as warfarin dose prediction and joke recommendation.
Submission Number: 108
Loading