Maximum Probability-driven Bandit Learning for Matching Markets

Maximum Probability-driven Bandit Learning for Matching Markets

ICLR 2026 Conference Submission18824 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-armed Bandit，Matching Markets

Abstract: Security and robustness are crucial for ensuring stable and fair transactions in two-sided markets, given the complexity of preferences and uncertain returns experienced by the participants. In contrast to traditional competing bandits in two-sided markets that focus on maximum returns, we propose a maximum probability-driven bandit learning (P-learning) model that emphasizes risk quantification. Since one side of the market lacks prior knowledge about its preferences for the other, the proposed P-learning algorithm maximizes the probability of Mean-Volatility statistics lying in a preferred and attainable interval. A scalable and stable matching rule was proposed by combining P-learning with the Gale-Shapley matching algorithm that ensures secure and efficient outcomes. A detailed exploration-exploitation procedure of the matching algorithm has been presented with the support of a centralized platform. In both the single-agent setting and the multi-agent setting, our model achieves sublinear regret of $\mathcal{O}(\sqrt{n})$, under different conditions. This paper theoretically proves that the P-learning generates stronger statistical power than classical tests based on normality. Simulation studies demonstrate the superiority of our algorithm over the existing works.

Primary Area: reinforcement learning

Submission Number: 18824

Loading