Abstract: It is often of interest to learn a context-sensitive decision policy, such as in contextual multi-armed bandit processes. To quantify the efficiency of a machine learning algorithm for such settings, probably approximately correct (PAC) bounds, which bound the number of samples required, or cumulative regret guarantees, are typically used. However, many real-world settings have a limited amount of resources for experimentation, and decisions/interventions may differ in the amount of resources required (e.g., money or time). Therefore it is of interest to consider how to design an experiment strategy to learn a near-optimal contextual policy while minimizing the total amount of resources required. In contrast to RL or bandit approaches that embed costs into the reward function, here we focus on minimizing the resources needed to learn a near-optimal policy without resource constraints, which is similar to PAC-style approaches which seek to minimize the amount of data needed to learn a near-optimal policy. We propose two resource-aware algorithms for the contextual bandit setting and provide finite sample performance bounds on the resulting best policy that can be obtained from each of the algorithms. We also evaluate both algorithms on synthetic and semi-synthetic datasets and show that they significantly reduce the total resources needed to learn a near-optimal decision policy compared to prior approaches that use resource-unaware exploration strategies.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We added a new experiment using a real-word dataset focused on mailing different types of letters to people to increase voter turnout in the 2006 Michigan primary election (see details in the “experiment_on_voting_dataset.pdf” file in the Supplementary Material).
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 2570
Loading