Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback

Published: 16 Jan 2024, Last Modified: 21 Apr 2024ICLR 2024 posterEveryoneRevisionsBibTeX
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Combinatorial multi-armed bandit, $k$-MAX bandit, value-index feedback, maximum reward function
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We studied a combinatorial MAB problem for max reward function under a new feedback structure.
Abstract: We investigate the combinatorial multi-armed bandit problem where an action is to select $k$ arms from a set of base arms, and its reward is the maximum of the sample values of these $k$ arms, under a weak feedback structure that only returns the value and index of the arm with the maximum value. This novel feedback structure is much weaker than the semi-bandit feedback previously studied and is only slightly stronger than the full-bandit feedback, and thus it presents a new challenge for the online learning task. We propose an algorithm and derive a regret bound for instances where arm outcomes follow distributions with finite supports. Our algorithm introduces a novel concept of biased arm replacement to address the weak feedback challenge, and it achieves a distribution-dependent regret bound of $O((k/\Delta)\log(T))$ and a distribution-independent regret bound of $\tilde{O}(\sqrt{T})$, where $\Delta$ is the reward gap and $T$ is the time horizon. Notably, our regret bound is comparable to the bounds obtained under the more informative semi-bandit feedback. We demonstrate the effectiveness of our algorithm through experimental results.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Primary Area: learning theory
Submission Number: 4746
Loading