everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Two-sided matching under uncertainty has recently drawn much attention due to its wide applications. Matching bandits model the learning process in matching markets with the multi-player multi-armed bandit framework, i.e. participants learn their preferences from the stochastic rewards after being matched. Existing works in matching bandits mainly focus on the one-sided setting (i.e. arms are aware of their own preferences accurately) and design algorithms with the objective of converging to stable matching with low regret. In this paper, we consider the more general two-sided setting, i.e. participants on both sides have to learn their preferences over the other side through repeated interactions. Specifically, we formally introduce the two-sided setting and consider the rational and general case where arms adopt "sample efficient" strategies. Facing the challenge of unstable and unreliable feedback from arms, we design an effective algorithm that requires no restrictive assumptions such as special preference structure and observation of winning players. Moreover, our algorithm is the first to provide a theoretical upper bound and achieves $O(\log T)$ regret which is proved optimal in terms of $T$.