Bandits with Costly Reward Observations

Aaron David Tucker; Caleb Biddulph; Claire Wang; Thorsten Joachims

Bandits with Costly Reward Observations

Aaron David Tucker, Caleb Biddulph, Claire Wang, Thorsten Joachims

Published: 08 May 2023, Last Modified: 26 Jun 2023UAI 2023Readers: Everyone

Keywords: bandits, value of information, contextual bandits, upper confidence bounds

TL;DR: We provide algorithms, a regret lower bound, and experiments (synthetic and real data) for bandit problems where you need to pay a cost to observe the reward.

Abstract: Many machine learning applications rely on large datasets that are conveniently collected from existing sources or that are labeled automatically as a by-product of user actions. However, in settings such as content moderation, accurately and reliably labeled data comes at substantial cost. If a learning algorithm has to pay for reward information, for example by asking a human for feedback, how does this change the exploration/exploitation tradeoff? We study this question in the context of bandit learning. Specifically, we investigate Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit's action. We show that the observation cost implies an $\Omega(c^{1/3}T^{2/3})$ lower bound on the regret. Furthermore, we develop a general non-adaptive bandit algorithm which matches this lower bound, and we present several competitive adaptive learning algorithms for both k-armed and contextual bandits.

Supplementary Material: pdf

Other Supplementary Material: zip

0 Replies

Loading