Abstract: We study a contextual bandit setting where the agent has the ability to request multiple data samples – corresponding to potentially different context-action pairs – simultaneously in one-shot within a budget, along with access to causal side information. This new formalism provides a natural model for several real-world scenarios where parallel targeted experiments can be conducted. We propose a new algorithm that utilizes a novel entropy-like measure that we introduce. We perform multiple experiments, both using purely synthetic data and using a real-world dataset, and show that our algorithm performs better than baselines in all of them. In addition, we also study sensitivity of our algorithm’s performance to various aspects of the problem setting. We also show that the algorithm is sound; that is, as budget increases, the learned policy eventually converges to an optimal policy. Further, we show a bound on its regret under additional assumptions. Finally, we study fairness implications of our methodology.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In response to the reviewers' comments, we are uploading a revised pdf. Here is a summary of the main changes in this revised pdf:
(1) We have now added a new regret bound (under some additional assumptions). The bound as well the proof have been added to the paper (the proof is in the appendix).
(2) We have added experiments that scale B large enough so that our algorithm and the baselines all converge; they are provided in the appendix. We have added a short discussion on this as well.
(3) To stay within the 12 page limit for a regular submission, we have moved the proofs to the appendix.
(4) We have added a note in the appendix describing how our algorithm was implemented.
(5) Added intuition in appendix for why equal allocation is performing worse in Expt 2.
(6) Fixed various typos.
Assigned Action Editor: ~Jinwoo_Shin1
Submission Number: 1245
Loading