**Keywords:**online learning, matching markets, multi-armed stochastic bandits

**TL;DR:**We propose a two-sided bandit algorithm for matching markets with no structural assumptions.

**Abstract:**Online learning in a decentralized two-sided matching markets, where the demand-side (players) compete to match with the supply-side (arms), has received substantial interest because it abstracts out the complex interactions in matching platforms (e.g. UpWork, TaskRabbit). However, past works \citep{liu2020competing,liu2021bandit,ucbd3,basu2021beyond,SODA} assume that the supply-side arms know their preference ranking of demand-side players (one-sided learning), and the players aim to learn the preference over arms through successive interactions. Moreover, several structural (and often impractical) assumptions on the problem are usually made for theoretical tractability. For example \cite{liu2020competing,liu2021bandit,SODA} assume that when a player and an arm is matched, the information of the matched pair becomes a common knowledge to all the players whereas \cite{ucbd3,basu2021beyond,ghosh2022decentralized} assume a serial dictatorship (or its variant) model where the preference rankings of the players are uniform across all arms. In this paper, we study the \emph{first} fully decentralized two sided learning, where we do not assume that the preference ranking over players are known to the arms apriori. Furthermore, we do not have any structural assumptions on the problem. We propose a multi-phase explore-then-commit type algorithm namely Epoch-based CA-ETC (collision avoidance explore then commit) (\texttt{CA-ETC} in short) for this problem that does not require any communication across agents (players and arms) and hence fully decentralized. We show that the for the initial epoch length of $T_0$ and subsequent epoch-lengths of $2^{l/\gamma} T_0$ (for the $l-$th epoch with $\gamma \in (0,1)$ as an input parameter to the algorithm), \texttt{CA-ETC} yields a player optimal expected regret of $\mathcal{O}[T_0 \left(\frac{K \log T}{T_0 (\Delta^{(i)})^2}\right)^{1/\gamma} + T_0 (T/T_0)^\gamma]$ for the $i$-th player, where $T$ is the learning horizon, $K$ is the number of arms and $\Delta^{(i)}$ is an appropriately defined problem gap. Furthermore, we propose several other baselines for two-sided learning for matching markets.

**Submission Number:**34

Loading