Keywords: Multi-armed Bandit,Matching Markets
Abstract: Security and robustness are crucial for ensuring stable and fair transactions in two-sided markets, given the complexity of preferences and uncertain returns experienced by the participants. In contrast to traditional competing bandits in two-sided markets that focus on maximum returns, we propose a maximum probability-driven bandit learning (P-learning) model that emphasizes risk quantification. Since one side of the market lacks prior knowledge about its preferences for the other, the proposed P-learning algorithm maximizes the probability of Mean-Volatility statistics lying in a preferred and attainable interval. A scalable and stable matching rule was proposed by combining P-learning with the Gale-Shapley matching algorithm that ensures secure and efficient outcomes. A detailed exploration-exploitation procedure of the matching algorithm has been presented with the support of a centralized platform. In both the single-agent setting and the multi-agent setting, our model achieves sublinear regret of $\mathcal{O}(\sqrt{n})$, under different conditions. This paper theoretically proves that the P-learning generates stronger statistical power than classical tests based on normality. Simulation studies demonstrate the superiority of our algorithm over the existing works.
Primary Area: reinforcement learning
Submission Number: 18824
Loading