Bandit Learning in Matching: Unknown Preferences On Both Sides

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Bandits, Matching
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Two-sided matching under uncertainty has recently drawn much attention due to its wide applications. Matching bandits model the learning process in matching markets with the multi-player multi-armed bandit framework, i.e. participants learn their preferences from the stochastic rewards after being matched. Existing works in matching bandits mainly focus on the one-sided setting (i.e. arms are aware of their own preferences accurately) and design algorithms with the objective of converging to stable matching with low regret. In this paper, we consider the more general two-sided setting, i.e. participants on both sides have to learn their preferences over the other side through repeated interactions. Specifically, we formally introduce the two-sided setting and consider the rational and general case where arms adopt "sample efficient" strategies. Facing the challenge of unstable and unreliable feedback from arms, we design an effective algorithm that requires no restrictive assumptions such as special preference structure and observation of winning players. Moreover, our algorithm is the first to provide a theoretical upper bound and achieves $O(\log T)$ regret which is proved optimal in terms of $T$.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4905
Loading