Player-optimal Stable Regret for Bandit Learning in Many-to-one Matching Markets with Substitutability

Yi Xu; Fang Kong; Lijun Zhang; Shuai Li

Player-optimal Stable Regret for Bandit Learning in Many-to-one Matching Markets with Substitutability

Yi Xu, Fang Kong, Lijun Zhang, Shuai Li

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bandits, matching markets, stable matching, many-to-one markets

Abstract: Bandit learning in matching markets has gained increasing attention, where one side of participants (players) learns unknown preferences through repeated interactions with the other side (arms). While prior studies mainly address one-to-one settings, many real-world applications — such as online advertising and negotiation between suppliers and demanders — naturally involve many-to-one matchings. Under the widely adopted substitutability condition, which guarantees the existence of stable matchings, learning becomes more challenging: players struggle to discover opportunities to be accepted by desirable arms due to the complex, set-dependent nature of arm preferences. Existing studies in this setting provide regret guarantees only for the player-pessimal stable matching, where the player side receives the least favorable outcome among all stable matchings. In this work, we propose a new algorithm: RIFLE, that addresses these limitations via a randomized initialization to uncover indexable preferences and an index-based phase of identifying explorable arms with decentralized conflict-free exploring, tailored for substitutable many-to-one environments. We theoretically prove that RIFLE converges to the player-optimal stable matching with a cumulative regret bound of $O(\max\{N, K\} \log T / \Delta^2)$, where $N$ is the number of players and $K$ is the number of arms. This result makes two key contributions. First, our approach is more general: it operates under the most general preference — substitutable preference conditions without pre-setting arm index. Second, we derive a player-optimal stable regret bound that is currently the best-known for both one-to-one and many-to-one matching markets. Empirical evaluations demonstrate that our approach significantly outperforms existing baselines in both matching quality and convergence speed.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 8508

Loading