Keywords: nonparametric contextual bandits, policy learning, minimax optimality, k-nearest neighbor, adaptivity
Abstract: This paper is concerned with learning an optimal policy in a nonparametric
contextual bandit from offline, and possibly adaptively collected
data. Existing methods and analyses typically rely on i.i.d. offline
data, and a uniform coverage condition on the behavior policy. In
this work, similar to the single-policy concentrability coefficient,
we propose a relaxed notion of coverage that measures how well the
optimal action is covered by the behavior policy for the nonparametric
bandits. Under this new notion, we develop a novel policy learning
algorithm by combining the $k$-nearest neighbor method with the pessimism
principle. The new algorithm has three notable properties. First and
foremost, it achieves the minimax optimal suboptimality gap for any
fixed coverage level (up to log factors). Second, this optimality is attained adaptively, without requiring prior knowledge of the coverage level of the offline data. Last but not least, it maintains these guarantees even with adaptively collected offline data.
Primary Area: learning theory
Submission Number: 21158
Loading