Keywords: online learning
Abstract: We study the combinatorial online learning prediction problem with bandit feedback in a strategic setting, where the experts can strategically influence the learning algorithm’s predictions by manipulating their beliefs about a sequence of binary events. There are two learning objectives for the algorithm. The first is maximizing its cumulative utility over a fixed time horizon, equivalent to minimizing regret. The second objective is to ensure incentive compatibility, guaranteeing that each expert's optimal strategy is to report their true beliefs about the outcomes of each event. In real applications, the learning algorithm only receives the utility corresponding to their chosen experts, which is referred to as the full-bandit setting. In this work, we present an algorithm based on mirror descent, which achieves a regret of $O(T^{3/4})$ under both the full-bandit or semi-bandit feedback model, while ensuring incentive compatibility. To our best knowledge, this is the first algorithm that can simultaneously achieve sublinear regret and incentive compatibility. To demonstrate the effectiveness of our algorithm, we conduct extensive empirical evaluation with the algorithm on a synthetic dataset.
Supplementary Material: pdf
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4480
Loading