%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
%以下是新增的自定义格式更改
%\usepackage[]{caption2} %新增调用的宏包
\renewcommand{\figurename}{Fig.} %重定义编号前缀词
%\renewcommand{\captionlabeldelim}{.~} %重定义分隔符
 %\roman是罗马数字编号，\alph是默认的字母编号，\arabic是阿拉伯数字编号，可按需替换下一行的相应位置
% \renewcommand{\thesubfigure}{(\roman{subfigure})}%此外，还可设置图编号显示格式，加括号或者不加括号
% \makeatletter \renewcommand{\@thesubfigure}{\thesubfigure \space}%子图编号与名称的间隔设置
% \renewcommand{\p@subfigure}{} \makeatother
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
%\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{graphicx}
\usepackage{algorithm}
%\usepackage{algorithmic}
\usepackage{algpseudocode}
\usepackage{pifont}
\usepackage{threeparttable}
\usepackage{graphics}
\usepackage{subfigure}
\usepackage{bm}
\newcommand{\argmax}{\operatorname{argmax}}
\newcommand{\argmin}{\operatorname{argmin}}
\newcommand{\E}{\mathbb{E}}
%\newcommand{\mod}{\operatorname{mod}}
\newcommand{\comment}[1]{{\color{blue}{[siwei: #1]}}}
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newtheorem{definition}{Definition}
\newtheorem{remark}{Remark}
\newtheorem{fact}{Fact}
 \newtheorem{example}{Example}
\newtheorem{lemma}{Lemma}
\newcommand{\fang}[1]{{\color{red}{[Fang: #1]}}}
\newcommand{\addcite}[1]{{\color{red}{[Add cite]}}}
\usepackage{balance} % for balancing columns on the final page
% Use the postscript times font!
%\usepackage{graphicx}
%\usepackage{algorithmic}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize


\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Decentralized Two-Sided Bandit Learning in Matching Market}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2024 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Yirui Zhang}
\author[1,2]{Zhixuan Fang\thanks{\noindent Corresponding author: Zhixuan Fang at Tsinghua University (zfang@mail.tsinghua.edu.cn). This work is supported by Tsinghua University Dushi Program.}}
% Add affiliations after the authors
\affil[1]{%
    Institute for Interdisciplinary Information Sciences.\\
    Tsinghua University\\
    Beijing, China
}
\affil[2]{%
    Shanghai Qi Zhi Institute\\
    Shanghai, China\\
}

  
  \begin{document}
\maketitle

\begin{abstract}

 Two-sided matching under uncertainty has recently drawn much attention due to its wide applications. 
 Existing works in matching bandits mainly focus on the one-sided learning setting and design algorithms with the objective of converging to stable matching with low regret.  In this paper, we consider the more general two-sided learning setting, i.e. participants on both sides have to learn their preferences over the other side through repeated interactions. 
 Inspired by the classical result that the optimal matching for the proposing side can be obtained using the Gale-Shapley algorithm, our inquiry stems from the curiosity about whether this result still holds in a two-sided learning setting. To handle this question, we formally introduce the two-sided learning setting, addressing strategies for both the arm and player sides without restrictive assumptions such as special preference structure and observation of winning players. 
 Our results not only provide a positive answer to our inquiry but also offer a near-optimal upper bound, achieving $O(\log T)$ regret.%, which is proven to be optimal in terms of the time horizon $T$.
\end{abstract}
\vspace{-0.1 in}
\section{Introduction}

Stable matching with preferences on both sides is a classic problem with wide applications encompassing marriage, college admission, and labor markets. The classical literature \citep{roth1992two,roth1997turnaround,gale1962college} usually focuses on how to generate a stable outcome, i.e. how to find a stable matching where no pair wants to swap their partners. However, these works usually assume that every participant is aware of her own preference perfectly beforehand, which may not be satisfied in many scenarios. As an illustration, consumers may lack knowledge about the service qualities offered by service providers, and workers may find themselves unaware of the value associated with the provided positions. In contrast to the assumption of perfect prior knowledge of individual preferences, participants in real-world scenarios usually acquire information about their own utilities through repeated interactions. For instance, in online crowdsourcing platforms (Upwork and TaskRabbit) and question-answering platforms (Quora, Stack Overflow),  participants engage in repeated transactions, receiving stochastic rewards and learning their preferences over time.  This uncertain aspect of preference acquisition introduces complexities that go beyond the traditional stable matching literature, forming the basis for exploration in more realistic matching scenarios. 

%However, it seems not practical in real life. 
Recent literature on matching bandits (\citep{liu2020competing,basu2021beyond,liu2021bandit,sankararaman2021dominate,maheshwari2022decentralized}) has initiated exploration into scenarios where participants on one side seek to learn preferences through bandit feedback. We refer to this setting as the one-sided learning in matching bandits. In the context of learning uncertain individual preferences through repeated interactions, a pivotal question within the domain of matching markets revolves around understanding the convergence to equilibrium. In response to this question,  many works in matching bandits propose algorithms with the objective of achieving stable matching with low regret. However, given the absence of prior information about individual preferences and a centralized platform for information collection, many works resort to assumptions to simplify the preference learning process. Some assume direct observation (\citep{kong2023player,liu2021bandit,pokharel2023converging}), while others leverage special preference structures (\citep{basu2021beyond,maheshwari2022decentralized,sankararaman2021dominate}). 

In our work, we study the more general case where both sides lack the knowledge of their own preferences, referred to as the two-sided learning setting. Importantly, we do not make assumptions regarding observations or impose special preference structures.  Based on classical results in the matching market, the optimal matching for the proposing side can be obtained using the Gale-Shapley algorithm. Our inquiry revolves around the exploration of whether this result remains applicable in the context of a two-sided learning setting. Specifically, our study involves two distinct sides of participants: the player side and the arm side. At each time slot, players simultaneously propose to arms, and arms select one proposal from all the candidates. Every participant can learn her own preference only through the rewards obtained after each match. Our objective is to ascertain whether these participants can converge to the player-optimal stable matching and, if so, how fast the convergence occurs.



Intuitively thinking, in the two-sided learning setting, achieving a stable equilibrium appears improbable if arms consistently provide inaccurate feedback about their preferences.  To avoid this situation, we make several reasonable assumptions.  Given that players' decisions hinge on feedback from arms, the efficient learning of preferences by arms is crucial to providing valuable feedback to players.
Therefore, the learning speed of arms is the key to the problem.
 We measure the learning difficulty of the arm side by comparing it with the player side. In the main part of our paper, we consider a plausible scenario where the difficulty level of arms' preferences learning is comparable with that of players, up to a constant ratio of $D$. Additionally, taking into account the rationality of arms—specifically, their intention to maximize utilities—we introduce the concept of the "rational condition" and delve into the scenario where arms' strategies satisfy this condition.


As for the player-side strategies, we propose a new algorithm for the complex two-sided learning setting and provide rigorous regret analysis. 
 Our results show that the market converges to the optimal stable matching at a logarithmic rate. Specifically, our algorithm achieves $O(\log T/\Delta^2)$ regret with respect to player-optimal stable matching, where $T$ represents the time horizon and $\Delta$ represents the minimal gap of player utilities. This regret bound is tight in terms of $T$ and $\Delta$. The regret bound also matches with the state-of-art result in the simpler one-sided learning setting. Furthermore, the algorithm design and theoretical analysis methods themselves may also serve as a preliminary step for future studies in the two-sided learning matching bandits.

Moreover, as an extension and a preliminary investigation into the realm of more diverse strategies for arms, we  consider the case where arms adopt strategic policies to collaborate with players without the assumption of learning difficulty.

\subsection{Related Work}
MAB is a classic and well-studied framework that models the decision-making process under uncertainty (\citep{katehakis1987multi,auer2002finite}). A player faces $K$ arms with different utilities and aims to find out the best arm based on the stochastic reward received after each pull. The explore-then-commit (ETC) methods (\citep{garivier2016explore,rosenski2016multi}) , UCB-based strategies  (\citep{li2010contextual}) and Bayesian-type policies (\citep{chapelle2011empirical,scott2010modern}) are commonly used to address the trade-off between exploration and exploitation and to minimize regret.

The first work that combines MAB framework and matching markets is from \citep{das2005two}, and  \cite{das2005two} propose an algorithm with numerical study under the strong assumption that each side of the market is homogeneous.  \cite{liu2020competing} generalize the MAB based matching and propose basic ETC type and UCB type algorithms. However, \cite{liu2020competing} mainly consider the centralized setting which is not so practical in reality. 

Later, a line of research emerged to study decentralized matching bandits with one-sided learning. As mentioned earlier, various works make different assumptions about arm preferences. For instance, \cite{sankararaman2021dominate} analyze the scenario of globally ranked players where all arms rank players in the same order. Later, \cite{basu2021beyond} consider a more general case of uniqueness consistency and propose UCB-D4. Another specific case, $\alpha$-reducibility, is explored by \cite{maheshwari2022decentralized}. These assumptions are designed to ensure a unique stable matching.

When examining general preferences without constraints, it is common  for multiple stable matches to exist in the market. The least preferred stable matching for players is referred to as the player-pessimal stable matching, while the most preferred one is termed the player-optimal stable matching. Regret, defined concerning the optimal stable matching, is more desirable, as comparing it with the pessimal stable matching could result in additional linear regret compared to the optimal stable matching.
With accurate knowledge of arm preferences on both the arm side and player side, \cite{liu2021bandit} design a conflict-avoiding algorithm named CA-UCB, which upper-bounds the player-pessimal stable regret under the assumption of "observation." Similarly, \cite{kong2022thompson} analyze a Thompson Sampling-based conflict-avoiding algorithm with "observation." Focusing on general preferences, \cite{basu2021beyond} propose a phased-based algorithm but with a high exponential dependency on $\frac{1}{\Delta}$. Adopting the assumption of "observation," \cite{kong2023player} propose ETGS, which guarantees player-optimal stable regret. ML-ETC, proposed by \cite{zhang2022matching}, is an ETC-based algorithm that can apply to general preference structures, and it also upper-bounds the player-optimal stable regret without "observation".
\renewcommand\arraystretch{0.93}
\begin{table*} [!ht] \label{table_1}
\begin{center} 
\caption{Comparison between our work and prior results.}\label{table_compare} 
 \begin{threeparttable}
\resizebox{0.93\linewidth}{!}
{
\begin{tabular}{c  c c }   
\hline   
\textbf{ }  &\textbf{Assumptions}&\textbf{Player-Stable Regret} \\   
\hline   \citep{liu2020competing}  &
 one-sided, centralized, known $\Delta$ &\textbf{$O(K\log T/\Delta^2)*$}\\
 \hline 
  \citep{liu2020competing}  &
 one-sided, centralized &\textbf{$O(NK^3\log T/\Delta^2)$}\\
 \hline\citep{sankararaman2021dominate}
&   one-sided, globally ranked & \textbf{$O(NK\log T/\Delta^2)$}\\
\hline\citep{basu2021beyond}
& one-sided, uniqueness consistency &  \textbf{$O(NK\log T/\Delta^2)$} \\
\hline   \citep{maheshwari2022decentralized}
&  one-sided, $\alpha$-reducibility & \textbf{$O(CNK\log T/\Delta^2)$}\\
\hline   \citep{liu2021bandit}
 & one-sided, observation & \textbf{$O(\exp{(N^4)}K^2\log^2 T/\Delta^2)$}\\
 \hline   \citep{kong2022thompson}
 & one-sided,  observation & \textbf{$O(\exp{(N^4)}K^2\log^2 T/\Delta^2)$}\\\hline   \citep{basu2021beyond}
&  one-sided & \textbf{$O(K\log ^{1+\epsilon}T/\Delta^2+\exp{(1/{\Delta^2})})*$}\\
\hline   \citep{zhang2022matching}
 & one-sided & \textbf{$O(K\log T/\Delta^2)*$}\\
 \hline   \citep{kong2023player}
 & one-sided, observation & \textbf{$O(K\log T/\Delta^2)*$}\\
 \hline \citep{pokharel2023converging}& two-sided, observation & no  theoretical results\\
 \hline \citep{pagare2023two}& two-sided & \textbf{$O(T_0(K\log T/T_0\Delta^2)^{1/\gamma}+T_0(T/T_0)^\gamma)*$}\\
\hline   this paper  & two-sided &\textbf{$O( K\log T/\Delta^2)*$} \\      
\hline   
\end{tabular}   }
\begin{tablenotes}
\item [1] $K$ is the number of arms and $N$ is the number of players.
\item[2] $*$ represents the type of regret bound is player-optimal.
\item[3] $C$ relates to preferences and may grow exponentially in $N$.
\item[4] $\epsilon, \gamma$ are positive hyper-parameters, $\gamma$ belongs to $(0,1)$. $T_0$ is a hyper-parameter that needs information about $\Delta$.
\item[5] The table categorizes "one-sided" versus "two-sided" based on whether learning involves interaction from both parties, rather than focusing on market characteristics.

\end{tablenotes}
\end{threeparttable}
\end{center}   
\end{table*}




 The literature mentioned above often assumes knowledge of arm preferences and relies on precise feedback from arms. In contrast, \cite{pokharel2023converging} address the scenario where preferences on both sides are unknown in matching bandits. They propose the PCA-DAA algorithm, incorporating random delays to reduce the likelihood of conflicts, though their findings are currently supported only by empirical results. In a recent study, \cite{pagare2023two} introduce a multi-epoch ETC-type algorithm that achieves sub-linear regret. While the multi-epoch approach effectively reduces regret to a sub-linear level, it tends to induce over-exploration, resulting in still significant regret, typically polynomial.   Their algorithm also necessitates the specification of a hyper-parameter $T_0$, which is constrained by requirements pertaining to knowledge of the minimal gap. Specifically, their approach relies on the assumption of a non-small gap. Additionally, the algorithm mandates that arms employ a symmetry algorithm similar to that used by players.  Other works explore two-sided learning matching within the bandit framework from various angles. For example, \cite{jagadeesan2023learning} delve into matching markets under the stochastic contextual bandit model, where a platform, at each round, selects a market outcome with the goal of minimizing cumulative instability.
 \vspace{-0.1 in}
\section{Model}\label{setup}

Suppose there are $N$ players and $K$ arms, and denote the set of players and arms by $\mathcal{N}$ and $\mathcal{K}$ respectively. We adopt the commonly used assumption in matching bandits that $N \le K$ ( e.g. \citep{liu2021bandit,kong2023player,basu2021beyond,liu2020competing,basu2021beyond})\footnote{This assumption guarantees that each player can match with at least one arm. However, by adjusting the exploration phase, our algorithm remains effective even when $N > K$, with regret bounded by $O(N\log T/\Delta^2)$.}.  Both the player side and arm side are unaware of their preferences. 
Specifically, for each player $j$, she has a fixed but unknown utility $u_{jk}$ associated with each arm $k$ and prefers arms with higher utilities. For each arm $k$, it also has a fixed but unknown utility $u_{kj}^{a}$ associated with each player $j$ and prefers players with higher utilities (the superscript $a$ stands for 
"arm").  
Without loss of generality, we assume all utilities are within $[0,1]$, i.e. for every $j\in\mathcal{N}, k\in \mathcal{K}$, $u_{jk}, u_{kj}^a\in [0,1]$.  Define the utility gap for player $j$ as $\Delta_j=\min_{k_1,k_2\in \mathcal{K}, k_1 \ne k_2}|u_{jk_1}-u_{jk_2}|$ and the utility gap for arm $k$ as $\Delta_k^a=\min_ {j_1,j_2 \in\mathcal{N}, j_1 \ne j_2} |u^a_{kj_1}-u^a_{kj_2}|$. As a common assumption in previous work (e.g. \citep{pokharel2023converging,liu2020competing,liu2021bandit}),  all preferences are strict, which means that both the minimal gap of player $\Delta=\min_{j\in \mathcal{N}}\Delta_j$ and the minimal gap of arm $\Delta^a=\min_{k\in \mathcal{K}}\Delta_k^a$ are positive. Moreover,  we consider the reasonable case where the
difficulty level of arms’ preferences learning is comparable with players’ up to a positive constant $D\in (0,\infty)$. Specifically, we assume $D\Delta^a\ge\Delta_j$ for all $j\in\mathcal{N}$ in the main part of the paper (except for Section \ref{EXTENSION}). 
Throughout the time horizon $T$, every player and arm will learn about their own preferences through interactions and want to match with one from the other side with higher utility. We use the notation $j_1 \succ_k j_2$ to indicate that arm $k$ prefers player $j_1$ to player $j_2$ and the similar notation $k_1 \succ_j^a k_2$ to represent that player $j$ prefers arm $k_1$ to arm $k_2$. 

At each time step $t \le T$, each player $j$ pulls an arm $I_j(t)$ simultaneously. 
 If there exists one player pulling the arm $k$, we assume that the arm $k$ will choose to match with the player rather than staying unmatched since all utilities are non-negative. When there are multiple players pulling arm $k$, a conflict arises, and arm $k$ will choose to match one of the candidates based on its strategy (see details in Section \ref{arm_belief}). The unchosen players will get rejected and obtain no reward.
Denote the winning player on arm $k$ at time step $t$ by $A_k(t)$.
Let $C_j(t)$ represent the rejection indicator of player $j$ at time step $t$. $C_j(t)=1$ indicates that player $j$ gets rejected and $C_j(t)=0$ otherwise. 
When a match succeeds between player $j$ and arm $k$, both player $j$ and arm $k$ receive stochastic rewards sampled from the fixed latent $1$-subgaussian distributions with mean $u_{jk}$ and $u_{kj}^{a}$, respectively. In this paper, we consider the general fully decentralized setting, i.e., no direct communication among players is allowed, and there is no central organizer or extra external information  such as observation.

\textbf{No Centralized Organizer and Explicit Communication.} When considering learning in matching markets, several studies (\citep{min2022learn,jagadeesan2023learning})  assume the existence of a central platform capable of directly determining the matching outcome. However, real-world applications typically involve participants who act individually and independently, reflecting a decentralized setting where there is no central organizer or explicit communication between players to facilitate direct coordination among players. Additionally, due to privacy considerations, participants may opt out of disclosing their received rewards (\citep{rees2018experimental}). For scalability reasons, decentralized solutions are also favored (\citep{larsson2018law}). Notably, the majority of the related works referenced in our paper consider matching bandits from a decentralized perspective and emphasize the importance of the decentralized setting (\citep{sankararaman2021dominate,liu2021bandit}).

\textbf{No Observation of Winning Players.} In the literature on matching bandits, observation of winning players (which assumes that all players can observe all the winning players on all arms) is a strong but widely used assumption. Even when some arms are never selected by the player, the player can also get their information based on observation. This assumption greatly helps players to learn arms' preferences and other players' actions.  \cite{liu2021bandit} incorporate the observation to design a conflict-avoid algorithm,  \cite{kong2023player} use the observation to help players infer others' learning progress easily. However, it will be more challenging but more desirable to throw away the assumption. In real applications, the common case is that a player will only be informed of her own result (acceptance or rejection) rather than being aware of every accepted player.  The assumption of no observation also captures the fully decentralized scenario, i.e. players take actions only based on their own matching histories,
without access to others’ information.
%\vspace{-0.5 in}
\subsection{Regret for Both Sides}
 Before we introduce the definition of regret, we recall the definition of matching stability, which is an important issue when considering matching bandits in matching markets. 

A matching between the player side and the arm side is stable if there does not exist a (player, arm) pair such that each one prefers the other partner to the current matched partner. %Denote the set all the stable matchings by $M=\{m: m \text{ is stable}\}$ and 
%Let $m_j$ represent the matched pair of player $j$ in the matching $m$. 
For each player $j$, her optimal stable arm  $\overline{m}_j$ is the arm with the highest utility among her matched arms in all possible stable matchings while her pessimal stable arm $\underline {m}_j$ is  the matched arm with the lowest utility. For each arm $k$, its optimal stable player  $\overline{m}_k^a$ is the player with the highest utility among its matched players in all possible stable matchings while its pessimal stable player $\underline {m}_k^a$ is the matched player with the lowest utility. We define stable regret by comparing the utility of the matched pair with the stable pair.
The player-optimal and player-pessimal stable regret for player $j$ are defined as follows, respectively:
\begin{align*}
\overline{R}_j(T)=\mathbb{E}[\sum_{t=1}^T(u_{j\overline{m}_j}-(1-C_j(t))u_{jI_j(t)})],\\
\vspace{-0.1 in}
\underline{R}_j(T)=\mathbb{E}[\sum_{t=1}^T(u_{j\underline{m}_j}-(1-C_j(t))u_{jI_j(t)})].
\end{align*}
Similarly, the arm-optimal and arm-pessimal stable regret for arm $k$ are defined as follows, respectively:
\begin{align*}
    \overline{R}_k^a(T)=\mathbb{E}[\sum_{t=1}^T(u_{k\overline{m}_k^a}^a-u_{kA_k(t)}^a)],\\
    \vspace{-0.1 in}
\underline{R}_k^a(T)=\mathbb{E}[\sum_{t=1}^T(u_{k\underline{m}_k^a}^a-u_{kA_k(t)}^a)].
\end{align*}
Furthermore, the Gale-Shapley (GS) algorithm outlined in \citep{gale1962college} ensures the existence of a stable matching. Consequently, the aforementioned definition is justified.

\textbf{Player-optimal Stable Regret.} The optimal stable regret is defined with respect to the optimal stable pair that has higher utility than the pessimal stable pair. Consequently, achieving sublinear optimal stable regret is considered more challenging and desirable.  However, a classical result  in \citep{gale1962college} indicates the impossibility of simultaneously achieving sublinear regret that is both player-optimal and arm-optimal. Gale and Shapley  also introduce the Gale-Shapley (GS) algorithm, which secures optimal stable matching for the proposing side. We aim to investigate whether a similar result holds in the context of the two-sided learning  setting. Therefore, in this paper, our primary focus lies on player-optimal stable matching.  
\subsection{Arms' Strategies}\label{arm_belief}

In this section, we specify the strategies for arms to choose matched pairs among candidates.  Furthermore, rather than focusing on a particular strategy like in \citep{pagare2023two}, we explore the broader scenario where arms have the flexibility to employ diverse strategies, provided that such strategies remain aligned with the overarching goal of maximizing their individual utilities.

If arms are aware of their preferences beforehand, they can straightforwardly select the option with the highest utility among the candidates to optimize their rewards. However, in the context of a two-sided learning setting, arms lack awareness of their own utilities and must engage in a learning process through interactions. This implies that arm $k$ can only make a selection to match one player based on past rewards received. Nevertheless, with the accumulation of samples from various players, arms' estimations of their own utilities become more accurate.  %Rational arms can then select their preferred player from the available candidates. Consequently, under the assumption of 
For arms that adopt "rational" strategies, once they have gathered a sufficient number of samples for each player, they are likely to choose to match with the player exhibiting the highest utility with a high probability.

We will formally introduce the rational condition to describe arms' learning strategies below, and we focus on such strategies for arms in the main part of the paper (except for Section \ref{EXTENSION}). We will see later that common bandit learning algorithms satisfy this condition (e.g., UCB, empirical estimator), showing that such an assumption on arms' behavior is not restricted. The empirical mean associated with player $j$ estimated by arm $k$ is denoted by $\hat{u}^a_{kj}$ and the matched times associated with player $j$ estimated by arm $k$ is denoted by $N_{kj}^a$. Define event $\mathcal{E}^a=\{\forall j \in \mathcal{N}, k \in \mathcal{K}, |\hat{u}^a_{kj}-u^a_{kj}|<2\sqrt{\frac{\log T}{N^a_{kj}}}\}$. The event $\mathcal{E}^a$ represents that the samples' quality is not too bad, i.e., the empirical means are not very far from true values at every time slot.  We will show in our proof that $\mathcal{E}^a$ is a high-probability event since all samples are drawn from sub-gaussian distributions.

 
 \begin{definition}[Arm's Rational Condition]
 We say arm $k$ adopts a strategy that satisfies $R$ rational condition, if after collecting $R\frac{\log T}{(\Delta^a)^2}$ samples for every player, conditional on $\mathcal{E}^a$, arm $k$ will choose to match with the player with highest utility among the candidates.
\end{definition}

Rational strategies not only guarantee that arms are rationally motivated to maximize their individual utilities but also facilitate the prompt provision of valuable feedback for players. When adopting rational strategies, arms exhibit a tendency to avoid choosing suboptimal candidates extensively, as long as the quality of samples remains reasonably satisfactory. Consequently, players will not receive inaccurate feedback a lot. 

    As mentioned, the rational condition is not a strict constraint. Through simple calculation and scaling, we can see that numerous widely employed bandit-learning techniques, including the Upper Confidence Bound (UCB) policy and those following the empirical leader, meet the criteria for the rational condition with $R=16$. Moreover, our assumption for arms' strategies covers scenarios where some arms use empirical mean estimators to choose which invitation to accept, while other players use UCB estimators to select candidates.
\section{Round-Robin ETC Algorithm}
 In this section, we propose our algorithm for players: Round-Robin ETC which obtains an asymptotic $O(\log T)$ player-optimal stable regret. By introducing Round-Robin ETC,  we demonstrate that, even in the context of two-sided unknown preferences, it is possible to attain the optimal stable matching for the proposing side with low regret.
\subsection{Challenges and Solutions}

In this subsection, we will discuss some unique challenges in the two-sided learning matching bandits, as well as how our proposed techniques address the challenges. Then, we give a brief introduction of the major phases in our algorithm.

 
A unique challenge brought by the two-sided learning setting lies in the asymmetry of the learning ability on both sides. Intuitively, it will be harder for arms to collect enough samples since players can choose arms proactively while arms can only passively choose one player from the candidates. Despite this asymmetry, it is imperative for arms to expediently and accurately learn their preferences, as this early learning is crucial for ensuring that players receive accurate information during conflicts. In essence, addressing this asymmetry is crucial for the overall success of the learning process in a two-sided learning setting.

Another typical challenge lies in the absence of explicit communication channels or direct observation of conflict results. 
 Reaching the correct optimal stable matching  requires cooperation between the independent players and arms given limited communication. It is challenging but crucial to let players decide on when to end their individual exploration and to start a collective matching process. 
%The absence of explicit communication channels or direct observation of conflict results poses significant challenges.
Players face difficulties in discerning the exploration progress of other players, and it becomes even more challenging to understand the exploration progress of arms, given that players can only infer this from the passive actions of arms (i.e., selecting one player from the candidates). Consequently, determining the optimal timing and method to foster cooperation and converge to stable matching poses significant challenges for those independent players.

 To tackle these challenges, we initially acknowledge that in conflict-free scenarios, i.e., when only one player pulls one arm, a successful match occurs, generating a clear sample for both the player and the arm. Leveraging the concept of round-robin exploration, we aim to avoid conflicts and facilitate simultaneous preference learning for both the player-side and the arm-side. Additionally, we integrate confidence bounds to enable players to measure their individual exploration progress and wait for arms to accumulate sufficient samples. In order to estimate the learning progress of other players, we design decentralized communication through deliberate conflicts, which allow players to send and infer information. Specifically, players will deliberately compete for an arm, trying to send information by letting other players get rejected or to receive information by inferring from the rejection indicators. 
Furthermore, we carefully design the algorithm such that players can enter exploitation as soon as possible, i.e., they do not need to wait until all others have learned their preferences accurately. The intuitive idea is that, if a player is to start exploitation, she only needs to make sure that any other player that could potentially "squeeze" her out has already entered (or is also about to enter) exploitation. 

 Together with these analyses, we provide a brief introduction of our algorithm.
Firstly, the algorithm will assign a distinct index to each player.
 Next, players will do rounds of round-robin exploration. After every round of exploration, players will communicate their progress of preference learning  to decide on whether to start matching. If players decide to start matching, they will run the  Gale-Shapley (GS)  algorithm and occupy their potential optimal stable arm till the end. Otherwise, the players will start a new round of exploration.
\vspace{-0.2 in}
 \subsection{Round-Robin ETC}
 The Algorithm \ref{algorithm1} consists of $3$ phases: "Index Assignment", "Round Robin" and "Exploitation".  
 Players will enter the "Index Assignment" phase and the "Round Robin" phase simultaneously but may leave the "Round-Robin" phase for the "Exploitation" phase at different time steps.

In the "Index Assignment" phase (Line \ref{indexassign}), every player will receive a distinct index. To be specific (see procedure \textit{INDEX-ASSIGNMENT}), every player will keep pulling arm $1$ until the first time, say step $t$, she doesn't get rejected. She will be assigned index $t$ and then move to pull the next arm, i.e., arm $2$. Since there can only be one player that successfully matches with arm $1$ at each time step, after $N$ time steps, all players can receive different indices.



In the "Round Robin" phase (Line \ref{phase2start}-\ref{phase2end}), the players will (1) explore the arms without conflict, (2) communicate on their progress of exploration, and (3) start matching or update their indices and available arms in a round based way. Specifically, each round comprises three sub-phases: (1) exploration, (2) communication, and (3) update. A player will leave the "Round Robin" phase when she finds out her optimal stable arm confidently. Then,  she will enter the "Exploitation" phase and occupy her potential optimal stable arm, say arm $k$, making  arm $k$ unavailable to other players. Denote the set of players that are still in the "Round Robin" phase by $\mathcal{N}_2$, the number of remaining players by $N_2$, the available set of arms by $\mathcal{K}_2$, and the number of available arms by $K_2$. We further elaborate on the three sub-phases in "Round Robin" below.
 \begin{algorithm}[ht]
\caption{Round Robin ETC (for a player $j$)}\label{algorithm1}
    \begin{algorithmic}[1]
    \Statex \# Phase $1$: Index Assignment
    \State Index $\gets$ \textit{INDEX-ASSIGNMENT($N,\mathcal{K}$)}\label{indexassign}
    \Statex \# Phase $2$: Round Robin
    \State $N_2 \gets N, \mathcal{K}_2\gets \mathcal{K},K_2\gets K$ \text{ \# $N_2$ is the number of} \Statex \text{remaining players in Phase 2, $\mathcal{K}_2$ is available arms}\While{OPT$=\emptyset$}\label{phase2start} \Statex\text{\# when $j$ hasn't found her potential optimal stable arm}
    \Statex  \text{\#Sub-Phase: Exploration }
    \State \resizebox{0.98\linewidth}{!}{(Success$,\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j$) $\gets$ \textit{EXPLORATION}(Index, $K,K_2,\mathcal{K}_2,\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j$)}\label{exploration}\label{success1}
    \Statex  \text{\#Sub-Phase: Communication }
    \State Success $\gets$ \textit{COMM}(\text{Index, Success}, $N_2,K_2,\mathcal{K}_2$)
    \label{success2}
    \label{commend}
    
    \Statex \#Sub-Phase:  Update 
    \State OPT $\gets$ \textit{GALE-SHAPLEY},$N_1\gets N_2, \mathcal{K}_1\gets \mathcal{K}_2$%$(\text{Success},N_2,\mathcal{K}_2,\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j)$
    \label{updatestart}
    \Statex\text{\# $N_1,\mathcal{K}_1$ are temporary parameters to help update $N_2,\mathcal{K}_2$}
    \If{Success$=1$}\label{success3} \textbf{Break while}\Statex\text{\#successful players will enter the exploitation phase}
    \EndIf
    \For{$t=1,...,N_2K_2$}\#check arms' availability
    \If {$t\hspace{-0.04 in}=\hspace{-0.04 in}(\text{Index}\hspace{-0.05 in}-\hspace{-0.04 in}1)K_2\hspace{-0.04 in}+\hspace{-0.04 in}m$} %{\small\#check arms' availability}
    \label{UPT}
    \State Pull arm $k$ that is $m$-th arm in $\mathcal{K}_2$
    \If {$C_j\hspace{-0.03 in}=\hspace{-0.03 in}1$} $\mathcal{K}_1\gets \mathcal{K}_1\setminus \hspace{-0.02 in}\{k\}, N_1\hspace{-0.02 in}\gets N_1\hspace{-0.02 in}-\hspace{-0.02 in}1$
    \EndIf
    \EndIf
    \EndFor
    \State $N_2\gets N_1, \mathcal{K}_2\gets\mathcal{K}_1$ \#update available arms and number of players
    \State Index $\gets$ \textit{INDEX-ASSIGNMENT($N_2,\mathcal{K}_2$)}\label{updateend}
    \EndWhile\label{phase2end}
    \Statex \#Phase $3$: Exploitation Phase:
    \State Pull OPT arm\label{exploitation}
    \end{algorithmic}
    %\caption{\textit{INDEX-ASSIGNMENT} (for player $j$) }\label{index}
        \hrulefill
        \begin{algorithmic}[1]
    %\Require $N,\mathcal{K}$
    \Statex \textbf{procedure }\textit {INDEX-ASSIGNMENT}$(N,\mathcal{K})$
        \State $\pi \gets \mathcal{K}[1]$
    \For {$t=1,2,...,N$} 
    \State Pull arm $\pi$
    \If {$C_j=0$, $\pi=\mathcal{K}[1]$}
     Index $\gets t$, $\pi \gets \mathcal{K}[2]$
    \EndIf
    \EndFor
    \\\Return Index
    \end{algorithmic}
\end{algorithm}
\vspace{-0.2 in}
\begin{enumerate}[leftmargin=*]
    \item Exploration (Line \ref{exploration}, see Algorithm \ref{al_exploration} \textit{EXPLORATION}). Every player will explore available arms according to the index to avoid conflict, and every exploration will last for $K_2K^2\lceil \log T\rceil$ time steps. Based on the distinct index and the assumption that $K\ge N$, each arm is pulled by at most one player at each time step during the exploration, preventing any conflicts. Player $j$ will update her empirical mean $\hat{\boldsymbol{u}_j}$ and the matched times $\boldsymbol{N}_j$ throughout the exploration. To measure her progress in preference learning, player $j$ will incorporate confidence bounds.  The notions of upper confidence bound "UCB" and lower confidence bound "LCB" are defined as follows: 
    \begin{equation}\label{UCB}
        \text{UCB}_{jk}=\hat{u}_{jk}+c\sqrt{\frac{\log T}{N_{jk}}} \text{, LCB}_{jk}=\hat{u}_{jk}-c\sqrt{\frac{\log T}{N_{jk}}},\nonumber
    \end{equation}where $\hat{u}_{jk}$ denotes the empirical mean and $N_{jk}$ denotes the times player $j$ is matched with arm $k$.  Let $c=\max\{2,\frac{\sqrt{R}D}{2}+1\}$. %$c=\max\{2D+1,\frac{\sqrt{R}D}{2}+1,\frac{\sqrt{R}}{2}+1,3\}$.
    We say that when a player $j$ achieves a confident estimation on the arm set $\mathcal{K}^*$ if for every $k_1, k_2 \in \mathcal{K}^*$ such that $k_1 \ne k_2$, either $\text{UCB}_{jk_1} <\text{LCB}_{jk_2}$ or $\text{LCB}_{jk_1} >\text{UCB}_{jk_2}$ holds.

    \begin{algorithm}[tb]
    \caption{\textit{EXPLORATION} (for player $j$)}\label{al_exploration}
    \begin{algorithmic}[1]
    \Require Index, $K_1,K,\mathcal{K},\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j$
     \For{$t=1,2,...,KK_1^2\lceil \log T\rceil$} 
    \State Pull $(\text{Index}+t)\hspace{-0.1 in}\mod K=m$-th arm in $\mathcal{K}$ and update $\hat{u}_{jk}$, $N_{jk}$
    \EndFor
    %\State Compute the LCB$_{jk}$ and UCB$_{jk}$ for $k \in \mathcal{K}_2$
    \If {\resizebox{1\linewidth}{!}{$\forall k_1\ne k_2\in \mathcal{K}$, $\text{UCB}_{jk_1}<\text{LCB}_{jk_2}$ or $\text{LCB}_{jk_1}>\text{UCB}_{jk_2}$}} 
     Success $\gets 1$ \Statex\text{\# whether the player achieves a confident estimation}%\label{success1}
    \EndIf
    \\\Return Success, $\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j$
    \end{algorithmic}
\end{algorithm}
    \begin{algorithm}[tb]
    \caption{\textit{COMM} (for player $j$)}\label{COMM}
    \begin{algorithmic}[1]
        \Require \text{Index, Success}, $N,K,\mathcal{K}$
        \For{$i=1,2,...,N$,$\text{t\_index}=1,2,...,N$, $\text{r\_index}=1,2,...,N$, r\_index$\ne$t\_index}\Statex \text{ \# player with t\_index is transmitter,  r\_index is receiver}\label{comm_player}
        \For{$m=1,2,...,K$}\label{comm_arm}\Statex\text{ \# communication process is through conflicts on $m$-th arm}
        \If{Index$=$t\_index \# if transmitter} 
          \State Pull the $m$-th arm in $\mathcal{K}$ if Success$=0$
        \EndIf
        \If {Index$=$r\_index \# if receiver} 
        \State Pull $m$-th arm in $\mathcal{K}$,  Success$\gets 0$ if $C_j=1$
        % \If{$C_j=1$} Success$\gets 0$
        % \EndIf
        \EndIf
        \EndFor
        \EndFor
        \\\Return Success
    \end{algorithmic}
\end{algorithm}


    \item Communication (Line \ref{commend}, see Algorithm \ref{COMM} \textit{COMM}). The players will communicate through deliberate conflicts in an index-based order.  This sub-phase lets players communicate on their progress of exploration. Specifically, they will communicate whether they have achieved confident estimations and the communication proceeds pairwise following the order of the index. The player with index $1$ will first serve as a transmitter, sending information to the player with index $2$, then to players with index $3$, $4$ and so on. After the player with index $1$ has finished sending information to others, the player with index $2$ will be the transmitter, then the player with index $3$ and so on. The player who wants to receive information is the receiver. 
    
    The communication subphase conducts all pairwise communication between all pairs of remaining players on all available arms for $N_2$ times. Specifically, for every pair of different remaining players $j_1$ and $j_2$, communication occurs on every available arm for $N_2$ times. Every communication is conducted through a deliberate conflict on a communication arm of $\mathcal{K}_2$ between a transmitter and a receiver. The player with index "t\_index", denoted as player $j_1$, serves as the transmitter, and the player with index "r\_index" is the receiver (Line \ref{comm_player} in Algorithm \ref{COMM}). Suppose $j_2$ is the receiver, and arm $k$ is the communication arm, i.e. the $m$-th arm of $\mathcal{K}_2$ (Line \ref{comm_arm} in Algorithm \ref{COMM}). The receiver $j_2$ will choose arm $k$ to receive information. The transmitter $j_1$ will choose arm $k$ only when she fails to achieve a confident estimation or has been rejected when receiving others' information in the previous time steps during the communication sub-phase. Other players will pull an arbitrary arm $k' \neq k$.
    
    If a player achieves a confident estimation and never gets rejected when receiving others' information during the communication sub-phase, we say that the player obtains successful learning. Note that if a player obtains successful learning, it means that with high probability, the remaining players that may "squeeze" her out on the available arms all achieve confident estimations (and all obtain successful learning). We use "Success" in the pseudocode (Line \ref{success1}, \ref{success2}, \ref{success3}) to denote the success signal, and "Success$=1$" indicates that the player obtains successful learning while "Success$=0$" otherwise.  We call the players who obtain successful learning  successful players, and others are called unsuccessful players. 
    
    \item Update (Line \ref{updatestart}-\ref{updateend}). The successful players will be able to find out their potential optimal stable arms, and unsuccessful players will update their indices, the number of remaining players $N_2$, and the set of available arms $\mathcal{K}_2$.  The first procedure \textit{GALE-SHAPLEY} (\citep{gale1962college}) enables successful players to match their potential optimal stable arms. Then successful players will enter the "Exploitation" phase, and unsuccessful players will update the available arms in order. Specifically, when $t=(n-1)K_2+m$ in Line \ref{UPT}, the player with index $n$, suppose player $j$, will pull the $m$-th arm in $\mathcal{K}_2$, suppose arm $k$, to check its availability. If player $j$ gets rejected, then she will kick arm $k$ out of the available arm set. Lastly, unsuccessful players will update their indices by the \textit{INDEX-ASSIGNMENT} function and start a new round.
\end{enumerate}





In the "Exploitation" phase (Line \ref{exploitation}), every player will keep pulling her potential optimal stable arm till the end.

\textbf{Rationale of the Communication.} Note that in the scenario where only a subset of players completes their preference learning and initiates the exploitation of stable pairs within this subset, the stable pairs obtained may not align with those in the original stable matching. Previous research often relies on specific preference frameworks to address this issue, such as uniqueness consistency and global ranking assumptions. Nevertheless, when dealing with a general preference structure, it is common for other players with higher priority to potentially squeeze those settled players out from their current pairs, resulting in chaos in the matching markets. Communication serves the purpose of conveying information about whether these high-priority players have finished their explorations. Therefore, the communication process indeed helps the players match with their optimal stable arms effectively and with minimal cost. 

\begin{example}[Example of Round Robin phase]
Consider a matching market with three arms denoted as $a_1, a_2, a_3$ and three players denoted as $p_1, p_2, p_3$. The individual preferences are outlined as follows:
\begin{eqnarray*}
    p_1: a_1 \succ a_2 \succ a_3, \quad a_1: p_1 \succ^a p_3 \succ^a p_2,\\
    p_2: a_3 \succ a_2 \succ a_1, \quad a_2: p_1 \succ^a p_3 \succ^a p_2,\\
    p_3: a_2 \succ a_3 \succ a_1, \quad a_3: p_3 \succ^a p_1 \succ^a p_2.
\end{eqnarray*}
If arms consistently provide accurate feedback once any player achieves a confident estimation (an event with high probability), the Round Robin phase may proceed as follows:
\begin{table} [h] 
\begin{center}   
\caption{An example of Round Robin Phase.} \label{table_example} 
\resizebox{\linewidth}{!}{
\begin{tabular}{c  c  c c c }   
\hline   
\textbf{round}& \textbf{remaining players}&\textbf{confident players}  &\textbf{successful players}&\textbf{available arms} \\   
\hline   \small{$1$} &\small{$\{p_1,p_2,p_3\}$}
&\small{$\{p_3\}$} & \small{$\emptyset$}&\small{$\{a_1,a_2,a_3\}$}\\
    \hline   \small{$2$} &\small{$\{p_1,p_2,p_3\}$}
     & \small{$\{p_1,p_3\}$} & \small{$\{p_1,p_3\}$}&\small{$\{a_1,a_2,a_3\}$}\\
\hline   \small{$3$}  &\small{$\{p_2\}$}
&\small{$\{p_2\}$} & \small{$\{p_2\}$}&\small{$\{a_3\}$}\\
\hline         
\end{tabular}   }
\end{center}  
\end{table}
\vspace{-0.2 in}

The round-robin phase comprises three rounds, resulting in two rounds of communication.

In the first round, only player $p_3$ attains a confident estimation. Consequently, during communication, player $p_1$ will initially pull the three arms in order twice, serving as the transmitter. During the first three pulls, player $p_2$ will sequentially pull the three arms to receive information, serving as a receiver. Subsequently, during the next three pulls, player $p_3$ will serve as the receiver, i.e. pull the three arms to obtain information. However, when player $p_3$ pulls arms $1$ and $2$, she will get rejected, as these arms prefer player $p_1$. Consequently, no player achieves successful learning in this round.

In the second round, players $p_1$ and $p_3$ achieve confident estimations. During this round of communication, since players $p_1$ and $p_3$ have achieved confident estimations, they will not pull the same arm as the receiver when transmitting information. Consequently, receivers will not get rejected when receiving their information. Player $p_2$ will pull the same arm as the receiver. However, since players $p_1$ and $p_3$ are preferred over player $p_2$ on all arms, they will also not get rejected when receiving information from $p_2$. This implies that players $p_1$ and $p_3$ obtain successful learning and will proceed to the exploitation phase.

 In the last round, with only one player remaining, there is no communication.  Having achieved a confident estimation, player $p_2$ will leave the round-robin phase for the exploitation phase after this round.
\end{example}


\vspace{-0.2 in}
\subsection{Regret Analysis}
\begin{theorem}\label{theorem1}
If every player runs Algorithm \ref{algorithm1}, and arms adopt $R$ rational strategies, then the optimal stable regret of any player $j$ can be upper bounded by: 
\begin{align*}
        \overline{R}_j(T)\hspace{-0.03 in}\le& \resizebox{0.9\linewidth}{!}{$N\hspace{-0.03 in}+\hspace{-0.03 in}K^3r \lceil\log T\rceil\hspace{-0.03 in}+\hspace{-0.03 in}Nr(KN(N\hspace{-0.03 in}-\hspace{-0.03 in}1)\hspace{-0.03 in}+\hspace{-0.03 in}N\hspace{-0.03 in}+\hspace{-0.03 in}K\hspace{-0.03 in}+\hspace{-0.03 in}1)\hspace{-0.03 in}+\hspace{-0.03 in}4KN\hspace{-0.03 in}\hspace{-0.03 in}+\hspace{-0.03 in}2$}\nonumber \\=&O(\frac{K\log T}{\Delta^2}).%u_{j\overline{m_j}}}
\end{align*}
Moreover, the arm-pessimal stable regret for any arm $k$ can also be upper bounded by:
\begin{align*}
        \underline{R}_k^a(T)\hspace{-0.03 in}\le& \resizebox{0.9\linewidth}{!}{$N\hspace{-0.03 in}+\hspace{-0.03 in}K^3r \lceil\log T\rceil\hspace{-0.03 in}+\hspace{-0.03 in}Nr(KN(N\hspace{-0.03 in}-\hspace{-0.03 in}1)\hspace{-0.03 in}+\hspace{-0.03 in}N\hspace{-0.03 in}+\hspace{-0.03 in}K\hspace{-0.03 in}+\hspace{-0.03 in}1)\hspace{-0.03 in}+\hspace{-0.03 in}4KN\hspace{-0.03 in}\hspace{-0.03 in}+\hspace{-0.03 in}2$}\nonumber \\=&O(\frac{K\log T}{\Delta^2}),%u_{j\overline{m_j}}}
\end{align*}
\vspace{-0.1 in}
where $r$ equals to $\lceil\frac{4(c+2)^2}{K^2\Delta^2}\rceil$ and $c=\max\{2,\frac{\sqrt{R}D}{2}+1\}$.
\end{theorem}

 \begin{proof}[Proof Sketch] 
 We provide only a proof sketch for the player-optimal stable regret, with similar analysis applicable to derive the result for the arm regret. The complete proof is available in Appendix \ref{sec_proof}.

 Define the event $\mathcal{E}=\{\forall j \in \mathcal{N},  k \in \mathcal{K}, |\hat{u}_{jk}-u_{jk}|<2\sqrt{\frac{\log T}{N_{jk}}}\}$. We can decompose the regret depending on whether $\mathcal{E}$ and $\mathcal{E}^a$ holds, i.e. 
 \begin{align*}
     \overline{R}_j(t)\le&\mathbb{E}[\sum_{t=1}^T(u_{j\overline{m}_j}\hspace{-0.05 in}-(1-C_j(t))u_{jI_j(t)})|\mathcal{E}\cap\mathcal{E}^a]\\+&TPr[\neg \mathcal{E}]+TPr[\neg \mathcal{E}^a].
 \end{align*}
 While the probability of $\neg\mathcal{E}$ and $\neg \mathcal{E}^a$ can be upper bounded by a $\frac{1}{T}$ factor, we only need to bound the regret conditional on $\mathcal{E}\cap\mathcal{E}^a$. By the design of the algorithm, we can easily find out that the initialization phase lasts for $N$ time steps, which means there will be at most $N$ regret caused by the initialization phase. As for the other two phases, we can prove the following statements:
 \begin{itemize}
 \item Conditional on $\mathcal{E}$ and $\mathcal{E}^a$, with probability more than $1-\frac{2}{T}$, when a player achieves a confident estimation on the available arm set $\mathcal{K}_2$, the arms in $\mathcal{K}_2$ give accurate feedback.
 \item Conditional on $\mathcal{E}$ and $\mathcal{E}^a$, all players can achieve confident estimations after collecting $O(\log T)$ samples in the exploration.
     \item If arms in $\mathcal{K}_2$ give accurate feedback, conditional on $\mathcal{E}$ and $\mathcal{E}^a$, the successful players will pull their optimal stable arms in the exploitation phase.
 \end{itemize}
  Then according to the design of the algorithm and these statements, we can also prove that conditional on  $\mathcal{E}\cap\mathcal{E}^a$, after no more than $O(\log T)$ time steps in the round-robin phase, all players will enter the exploitation phase with their correct optimal stable arm with high probability. Combining these all together, we can obtain the results.
 \end{proof}
 
\section{Extending to Collaborative Arms}\label{EXTENSION}
%\section{Regret For Arms}
In this section, we examine scenarios in which arms implement more complex policies. We demonstrate that through collaboration between both parties, the assumption of learning difficulty can be eliminated. Furthermore, with support from the arms, players can achieve low regret. Specifically, we analyze the collaborative case with arbitrary learning difficulties and more complex arm strategies beyond rational strategies, in contrast to the previous scenario that assumed $\Delta^a > D\Delta$ and rational strategies for arms. The primary objective of this section is to serve as a preliminary exploration, laying the groundwork for a more comprehensive investigation into various arm strategies. Our aim is to understand how these diverse strategies employed by the arms can influence outcomes within the matching market. %We present our idea at a high level, with the intricate details available in the Appendix \ref{sec_extension} for reference.

\textbf{High-Level Idea.} The necessity for assuming learning difficulty arises from our objective to guarantee that, when a player possesses a confident estimation, the arms learn their own preferences accurately and offer precise feedback, thereby ensuring effective communication. However, if arms are granted the ability to employ more complex strategies, such as employing forced rejection to indicate whether she has completed preference learning, we can effectively address the previously mentioned issue.

\subsection{Players' Strategies}
Since the new algorithm (Algorithm \ref{algorithm3}) is similar to Algorithm \ref{algorithm1}, we will provide a brief overview, focusing primarily on the differences.

Players are initially assigned distinct indices. They then alternate between communication and exploration until every participant, including the arms, obtains a confident estimation. Specifically, players communicate through deliberate conflicts, gaining insights into others' learning processes. If there is a participant who hasn't achieved a confident estimation, all players return to exploration. Once all participants have confident estimations, players execute the \textit{GALE-SHAPLEY} algorithm to identify potential optimal stable arms and occupy them until the end.
\begin{algorithm}[tb]
\caption{Round Robin ETC with help from arms (for a player $j$)}\label{algorithm3}
    \begin{algorithmic}[1]
    \State Index $\gets$ \textit{INDEX-ASSIGNMENT($N,\mathcal{K}$)}
    \While{OPT$=\emptyset$} \Statex\text{\# when $j$ hasn't found her potential optimal stable arm yet}
    \State \resizebox{0.98\linewidth}{!}{(Success$,\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j$) $\gets$ \textit{EXPLORATION}(Index, $K,K,\mathcal{K},\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j$)}
    \State \resizebox{0.98\linewidth}{!}{Success $\gets$ \textit{COMM\_ARM}(\text{Index, Success}, $N_2,K_2,\mathcal{K}_2$)}\label{COMM_ARM}
    \If{Success$=1$}
    \State OPT $\gets$ \textit{GALE-SHAPLEY}%$(\text{Success},N_2,\mathcal{K}_2,\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j)$
    \EndIf
    \EndWhile
    \State Pull OPT arm\label{al_exploit}
    \end{algorithmic}
    \end{algorithm}
    \begin{algorithm}[tb]
    \caption{\textit{COMM\_ARM} (for player $j$)}\label{al_comm_arm}
    \begin{algorithmic}[1]
        \Require \text{Index, Success}, $N,K,\mathcal{K}$
        \If{Success$=1$} Pull arm $1$
        \Else\quad Pull arm $2$
        \EndIf
        \For{$k=1,2,...,K,t=1,2,...,N$} Pull arm $k$
        \If{$t=$Index,$C_j=1$} Success$\gets 0$
        \EndIf
        \EndFor
        \\\Return Success
    \end{algorithmic}
\end{algorithm}


Recall the notions of upper confidence bound "UCB" and lower confidence bound "LCB" for player $j$: 
    \begin{equation}\label{UCB2}
        \text{UCB}_{jk}=\hat{u}_{jk}+c\sqrt{\frac{\log T}{N_{jk}}} \text{, LCB}_{jk}=\hat{u}_{jk}-c\sqrt{\frac{\log T}{N_{jk}}},
    \end{equation}where $\hat{u}_{jk}$ denotes the empirical mean and $N_{jk}$ denotes the times player $j$ is matched with arm $k$.  In this section, let $c=2$. %$c=\max\{2D+1,\frac{\sqrt{R}D}{2}+1,\frac{\sqrt{R}}{2}+1,3\}$.
    We say that when a player $j$ achieves a confident estimation if for every $k_1, k_2 \in \mathcal{K}$ such that $k_1 \ne k_2$, either $\text{UCB}_{jk_1} <\text{LCB}_{jk_2}$ or $\text{LCB}_{jk_1} >\text{UCB}_{jk_2}$ holds.

    Regarding communication (Line \ref{COMM_ARM}, see Algorithm \ref{al_comm_arm} \textit{COMM\_ARM}), players will initially report their learning progress to arm $1$ and subsequently receive feedback from arm $1$, then arm $2$, and so on. Specifically, players with confident estimations will pull arm $1$, while others will pull a different arm. Subsequently, players will take turns receiving information about others' learning progress. More precisely,  each player will pull each arm for $N$ times, and successful learning is achieved only if she wins at the time step corresponding to her index. Successful learning implies that every participant has a confident estimation. Thus, if a player achieves successful learning, she will execute the \textit{GALE-SHAPLEY} algorithm to determine her potential optimal stable arm.
\vspace{-0.1 in}
\subsection{Arms' Strategies}
Arms continuously update their empirical means and matched times throughout the entire time horizon $T$ and assess their learning progress using confidence bounds. Additionally, arms act as communication intermediaries. Specifically, arms convey information to players by intentionally rejecting certain candidates.

Briefly speaking, arms will alternate between communication and selection (in Algorithm \ref{algorithm4}). The communication periods for arms coincide with those for players. While not engaged in communication, the arms will choose players myopically, selecting the most preferred candidates based on empirical means.

    \begin{algorithm}
        \caption{Arm Strategy (for an arm $k$)}\label{algorithm4}
        \begin{algorithmic}
        \State   1. Convey information during the communication period.
        \State 2. When not in communication, choose the most preferred candidate based on empirical means, and keep updating estimations.
        \end{algorithmic}
    \end{algorithm}
    
    Similarly, define the notions of upper confidence bound "UCB" and lower confidence bound "LCB" for arm $k$: 
    \begin{equation}\label{UCB3}
        \text{UCB}^a_{kj}=\hat{u}^a_{kj}+c\sqrt{\frac{\log T}{N^a_{kj}}} \text{, LCB}^a_{kj}=\hat{u}^a_{kj}-c\sqrt{\frac{\log T}{N_{kj}^a}},
    \end{equation}where $\hat{u}^a_{kj}$ denotes the empirical mean and $N^a_{kj}$ denotes the times arm $k$ is matched with player $j$.  Let $c=2$. %$c=\max\{2D+1,\frac{\sqrt{R}D}{2}+1,\frac{\sqrt{R}}{2}+1,3\}$.
    We say that when an arm $k$ achieves a confident estimation if for every two player $j_1, j_2$ such that $j_1 \ne j_2$, either $\text{UCB}^a_{kj_1} <\text{LCB}^a_{kj_2}$ or $\text{LCB}^a_{kj_1} >\text{UCB}^a_{kj_2}$ holds.
 \begin{algorithm}
    \caption{\textit{COMM\_ARM} (for arm $k^*$)}\label{al_comm_arm_k}
    \begin{algorithmic}[1]
        \State record the number of candidates as $N_p$ if arm $1$
        \For{$k=1,2,...,K,t=1,2,..N,$}
        \If{arm $1$}
 if achieves a confident estimation and $N_p=N$, accept the player with Index $t$
        \Else\quad if achieves a confident estimation, accept the player with Index $t$
        \EndIf\EndFor
    \end{algorithmic}
\end{algorithm}

   Regarding communication (See Algorithm \ref{al_comm_arm_k} \textit{COMM\_ARM}), arm $1$ first checks if the number of invitations equals the number of players and then selects an arbitrary candidate. Over the next $KN$ time steps, each arm communicates information about whether it has achieved a confident estimation to the players, and arm $1$ additionally conveys information about whether all players have achieved confident estimations. Specifically, during time step $t$ in the first period of $N$ time steps, if arm $1$ has a confident estimation and receives $N$ invitations during the previous check, it selects the candidate with index $t$. Similarly, for arm $k$, it chooses the candidate with index $t$ during time step $t$ in the $k$-th period of $N$ time steps only if it has a confident estimation.
   After the communication phase, if each participant attains a confident estimation, every player will be accepted at the designated time based on her index for each arm. Subsequently, the players are about to initiate a collective matching process.
   
   Note that in our algorithms, we consider that all arms have knowledge of the indices of all players. This can be easily adjusted by incorporating an index assignment procedure on each arm, which only requires $KN$ time steps in total.
\subsection{Regret Analysis}
The following theorem demonstrates the effectiveness of our algorithms. The detailed proof can be found in the Appendix \ref{sec_extension}.
\begin{theorem}
    If all players run Algorithm \ref{algorithm3} and arms adopt strategies Algorithm \ref{algorithm4}, then the optimal stable regret of any player $j$ can be upper bounded by \footnote{Similar result for arm pessimal stable regret can be simply obtained.} :
\begin{align*}
        \overline{R}_j(T)\le& N+K^3r \lceil\log T\rceil+r(1+KN)+4KN\nonumber \\=&O(\frac{K\log T}{\Delta_*^2}),%u_{j\overline{m_j}}}
\end{align*}
where $r$ equals to $\lceil\frac{64}{K^2\Delta^2_*}\rceil$ and $\Delta_*=\min\{\Delta,\Delta^a\}$.
\end{theorem}
\vspace{-0.2 in}
\section{Conclusion}
Inspired by the classical GS algorithm, in this work, we study the convergence to optimal stable matching for the proposing side in the two-sided learning matching markets. 
Throwing away many previous assumptions such as observations and special preference structures in matching bandits literature, we study the more general case and consider strategies for both sides.
We model the passive side, namely the arm side, with a reasonable "Rational Condition", where their objective is to maximize their individual rewards. 
Then, on the proactive side, i.e., the player side, we introduce the Round-Robin ETC algorithm, incorporating various techniques to tackle challenges arising from unreliable feedback from arms and the absence of information and communication.
Through rigorous analysis, we demonstrate that the optimal matching for the proposing side can be achieved with high probability. 
Moreover, our algorithm achieves an $O(\log T)$ player-optimal stable regret, which matches the order of the state-of-the-art guarantee in the simpler one-sided learning setting. The simulations provided in Appendix \ref{sec_simulation} further validate our results. 
To summarize, our work contributes to the understanding of the convergence dynamics in two-sided learning matching markets under the described conditions.
Subsequent research directions may involve examining cases where arms adopt other more strategic and sophisticated policies.
% It remains to investigate the situations involving dishonest arms that could potentially misreport their preferences even after having acquired a confident understanding of their own preferences.  
Furthermore, exploring the dynamics of the strategic interactions between the player-side and the arm-side could serve as an intriguing avenue for further study.%Moreover, arms may also be regraded as adversary with part of rationality. How to design a robust algorithm with low regret can also be an interesting issue.

\bibliography{ref}
\bibliographystyle{named}
\input{appendix}
\end{document}
