Abstract: We examine a $K-\mathbf{armed}$ multi-armed bandit problem involving probes, where the agent is permitted to probe one arm for a cost $c\geq 0$ to observe its reward before making a pull. We identify the optimal strategy for deciding whether to probe or pull an arm. In the case of probing an arm, we also make a decision on which arm to pull after observing the probe's outcome. Additionally, we introduce a novel regret definition based on the expected reward of the optimal action. We propose UCBP, a novel algorithm that utilizes this strategy. UCBP achieves a gap-independent regret upper bound in $T$ rounds that scales with $\mathcal{O}(\sqrt{KT\log T})$, and an order optimal gap-dependent upper bound that scales with $\mathcal{O}(K\log T)$. We provide UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound on the order of $O(K\sqrt{T\log T})$, and gap-dependent regret on the order of $O(K^{2}\log T)$ as a baseline. We provide empirical simulations to verify the utility of the UCBP algorithms in practical settings, and show that UCBP outperforms UCB-naive-probe in simulations.
External IDs:dblp:conf/isit/ElumarTY24
Loading