Multi-Armed Bandits With Costly Probes

Published: 01 Jan 2025, Last Modified: 26 Sept 2025IEEE Trans. Inf. Theory 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multi-armed bandits is a sequential decision-making problem where an agent must choose between multiple actions to maximize its cumulative reward over time, while facing uncertainty about the rewards associated with each action. The challenge lies in balancing the exploration of potentially higher-rewarding actions with the exploitation of known high-reward actions. We consider a multi-armed bandit problem with probes, where before pulling an arm, the decision-maker is allowed to probe one of the K arms for a cost $c\geq 0$ to observe its reward. We introduce a new regret definition that is based on the expected reward of the optimal action. We develop UCBP, a novel algorithm that utilizes this strategy to achieve a gap-independent regret upper bound that scales with the number of rounds T as $ O(\sqrt {KT\log T})$ , and an order optimal gap-dependent upper bound of $ O(K\log T)$ . As a baseline, we introduce UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound of $O(K\sqrt {T\log T})$ , and gap-dependent regret bound of $O(K^{2}\log T)$ ; and TSP, the Thompson sampling version of UCBP. In empirical simulations, UCBP outperforms UCB-naive-probe, and performs similarly to TSP, verifying the utility of UCBP and TSP algorithms in practical settings.
Loading