Instance-Optimal Pure Exploration for Linear Bandits on Continuous Arms

Sho Takemori; Yuhei Umeda; Aditya Gopalan

Instance-Optimal Pure Exploration for Linear Bandits on Continuous Arms

Sho Takemori, Yuhei Umeda, Aditya Gopalan

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a novel algorithm for a pure exploration problem with linear bandit feedback on a continuous arm set with an instance-dependent optimality.

Abstract: This paper studies a pure exploration problem with linear bandit feedback on continuous arm sets, aiming to identify an $\epsilon$-optimal arm with high probability. Previous approaches for continuous arm sets have employed instance-independent methods due to technical challenges such as the infinite dimensionality of the space of probability measures and the non-smoothness of the objective function. This paper proposes a novel, tractable algorithm that addresses these challenges by leveraging a reparametrization of the sampling distribution and projected subgradient descent. However, this approach introduces new challenges related to the projection and reconstruction of the distribution from the reparametrization. We address these by focusing on the connection to the approximate Carath\'eodory problem. Compared to the original optimization problem on the infinite-dimensional space, our method is tractable, requiring only the solution of quadratic and fractional quadratic problems on the arm set. We establish an instance-dependent optimality for our method, and empirical results on synthetic data demonstrate its superiority over existing instance-independent baselines.

Lay Summary: The linear bandit problem is an online optimization problem involving an unknown, linear reward function. The learner gains information about this function solely through noisy evaluations. Recognizing its practical relevance, we focus on the case of a continuous arm set (action space), which can be viewed as a specialized instance of Bayesian optimization. Furthermore, we address a pure exploration setting, where the learner's primary goal is to identify a high-performing arm as quickly as possible. Devising an (asymptotically) optimal learning strategy is a significant hurdle, as it necessitates optimization over the probability space defined on the arm set, a potentially infinite-dimensional space. We present a tractable algorithm, requiring a manageable number of oracle calls, and formally demonstrate its asymptotic optimality.

Link To Code: https://github.com/takemori135/pure-exploration-on-continuous-armset

Primary Area: Theory->Online Learning and Bandits

Keywords: pure exploration, linear bandits, continous arm set

Submission Number: 8111

Loading