Offline Policy Learning for Nonparametric Contextual Bandits under Relaxed Coverage

Offline Policy Learning for Nonparametric Contextual Bandits under Relaxed Coverage

ICLR 2026 Conference Submission21158 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: nonparametric contextual bandits, policy learning, minimax optimality, k-nearest neighbor, adaptivity

Abstract: This paper is concerned with learning an optimal policy in a nonparametric contextual bandit from offline, and possibly adaptively collected data. Existing methods and analyses typically rely on i.i.d. offline data, and a uniform coverage condition on the behavior policy. In this work, similar to the single-policy concentrability coefficient, we propose a relaxed notion of coverage that measures how well the optimal action is covered by the behavior policy for the nonparametric bandits. Under this new notion, we develop a novel policy learning algorithm by combining the $k$-nearest neighbor method with the pessimism principle. The new algorithm has three notable properties. First and foremost, it achieves the minimax optimal suboptimality gap for any fixed coverage level (up to log factors). Second, this optimality is attained adaptively, without requiring prior knowledge of the coverage level of the offline data. Last but not least, it maintains these guarantees even with adaptively collected offline data.

Primary Area: learning theory

Submission Number: 21158

Loading