\newpage
\appendix
\onecolumn


\section{\textit{GALE-SHAPLEY} in Competing Bandits}
\begin{algorithm}
\caption{\textit{GALE-SHAPLEY} (for a player $j$)}
    \begin{algorithmic}
    \Require $\text{Success},N,\mathcal{K},\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j$
    \State $i\gets 1$, sort $k \in \mathcal{K}$, let $k_h$ be the arm with $h$-th highest empirical mean in $\mathcal{K}$
        \For{$t=1,2,...,N^2$}
        \State Pull arm $k_i$
        \If{$C_j=1$}
        \State $i\gets i+1$
        \EndIf
        \EndFor
        \If{Success$=1$}
        \Return $k_i$ \textbf{else} \Return $\emptyset$
        \EndIf
    \end{algorithmic}
\end{algorithm}
In the \textit{GALE-SHAPLEY} algorithm, all the players will propose to their most preferred arms that they haven't encountered rejection on yet.
\begin{lemma}\citep{gale1962college}
    Suppose player $j$ obtains successful learning. If every player sorts all arms accurately, and every arm gives accurate feedback, then the output of the \textit{GALE-SHAPLEY} will equal to player $j$'s optimal stable arm.
\end{lemma}
Note that if all the left players $\mathcal{N}\setminus\mathcal{N}_2$ occupy their optimal stable arms, all the remaining players can also find out their optimal stable arms through the \textit{GALE-SHAPLEY} algorithm.



\section{Regret Proof}\label{sec_proof}
Before we analyze the regret bound, we clarify some notations and introduce some lemmas.


Note that $\hat{u}_{jk}$ represents the empirical mean associated with arm $k$, as estimated by player $j$, with $N_{jk}$ indicating the number of times they have been matched. Similarly, $\hat{u}_{kj}^a$ and $N^a_{kj}$ are utilized to denote the empirical mean and matched times associated with player $j$, as estimated by arm $k$. It is important to highlight that players only update empirical means and matched times during the exploration period, whereas arms continuously update their empirical means and matched times throughout the entire time horizon $T$. 

Recall that $\mathcal{N}$ denotes the set of players within the entire market, while $\mathcal{N}_2$ signifies the subset of remaining players within the "Round-Robin Phase". Similarly, $\mathcal{K}$ represents the set of arms within the entire market, while $\mathcal{K}_2$ denotes the available arms during the "Round-Robin" phase. The utility gap for player $j$ is denoted as $\Delta_j = \min_{k_1,k_2 \in \mathcal{K}, k_1 \neq k_2} |u_{jk_1}-u_{jk_2}|$, and the utility gap for arm $k$ is denoted as $\Delta_k^a = \min_{j_1,j_2 \in \mathcal{N}, j_1 \neq j_2} |u^a_{kj_1}-u^a_{kj_2}|$.
The minimal gap of players is defined as $\Delta = \min_{j \in \mathcal{N}} \Delta_j$, and the minimal gap of arms is defined as $\Delta^a = \min_{k \in \mathcal{K}} \Delta_k^a$.
Furthermore,  $D$ represents a comparative ratio between two sides, ensuring that $D\Delta^a \geq \Delta_j$ for any $j$ in $\mathcal{N}$.
\begin{lemma}\label{lemma_0}
    (Corollary 5.1 in \citep{lattimore2020bandit}) Assume that $X_i-u$ are independent, $\sigma $-subgaussian random
variables. Then for any $\epsilon\ge 0$ ,
\begin{equation*}
    \Pr[\hat{u}\ge u+\epsilon]\le \exp(-\frac{n\epsilon^2}{2\sigma^2}) \text{ and } \Pr[\hat{u}\le u-\epsilon]\le \exp(-\frac{n\epsilon^2}{2\sigma^2}),
\end{equation*}
where $\hat{u}=\frac{X_1+..+X_n}{n}$.
\end{lemma}
\begin{lemma}\label{lemma_event}
Define the event:
$\mathcal{E}=\{\forall j \in \mathcal{N},  k \in \mathcal{K}, |\hat{u}_{jk}-u_{jk}|<2\sqrt{\frac{\log T}{N_{jk}}}\}
\text{, and recall that }\mathcal{E}^a=\{\forall j \in \mathcal{N}, k \in \mathcal{K}, |\hat{u}^a_{kj}-u^a_{kj}|<2\sqrt{\frac{\log T}{N^a_{kj}}}\}$, 
    $\Pr[\neg\mathcal{E}]\le\frac{2KN}{T}$ and $\Pr[\neg\mathcal{E}^a]\le\frac{2KN}{T}$ hold.
    
\end{lemma}
\begin{proof}
    We can directly get the lemma according to Lemma \ref{lemma_0}.
\end{proof}
\begin{lemma}\label{lemma_2}
Conditional on $\mathcal{E}$ and $\mathcal{E}^a$, with probability more than $1-\frac{2}{T}$, when a player achieves a confident estimation on the available arm set $\mathcal{K}_2$, the arms in $\mathcal{K}_2$ give accurate feedback.
\end{lemma}
 Lemma \ref{lemma_2}  shows that as long as players have confidence on the estimations of arm utilities, the arms will give precise feedback with high probability.
\begin{proof}
    Suppose player $j$ is the first player who achieves a confident estimation, from the design of the algorithm, the remaining arm set $\mathcal{K}_2$ equals the whole arm set $\mathcal{K}$. Suppose arms $k_1, k_2 \in\mathcal{K}$ satisfy $u_{jk_1}-u_{jk_2}=\Delta_j$. Since player $j$ achieves a confident estimation, thus LCB$_{jk_1}>$UCB$_{jk_2}$ conditional on $\mathcal{E}$. During the exploration,  all the available arms are explored evenly and without conflict. Note that for player $j$ the rewards received are independent $1$-subgaussian random variables,  denote the rewards received after being matched with arm $k_1$ during the exploration by $X_1,X_2,...,X_n$ and the rewards associated with arm $k_2$ by $Y_1,Y_2,...,Y_n$, where $n=N_{jk_1}=N_{jk_2}$, $Z_1=X_1-Y_1,Z_2=X_2-Y_2,...,Z_n=X_n-Y_n$ are independent $\sqrt{2}$-subgaussian random variables. By applying Lemma \ref{lemma_0}, with probability more than $1-\frac{2}{T}$ we obtain that:
    \begin{eqnarray*}
        \Delta_j>\frac{Z_1+...+Z_n}{n}-2\sqrt{\frac{\log T}{n}}\ge\text{LCB}_{jk_1}-\text{UCB}_{jk_2}+\sqrt{R}D\sqrt{\frac{\log T}{n}}>D\sqrt{\frac{R\log T}{n}}.
    \end{eqnarray*}
    Note again that all the available arms are explored evenly and without conflict in the exploration. Thus the matched times for arms satisfy that $N_{kj'}^a\ge n\ge\frac{RD^2\log T}{\Delta_j^2}\ge\frac{R\log T}{(\Delta^a)^2}$ for every $k \in \mathcal{K}$ and every $j' \in\mathcal{N}$. According to the definition of $R$-rational condition, conditional on $\mathcal{E}^a$, arms give accurate feedback.%Conditional on $\mathcal{E}^a$, for any arm $k$ and any two players $j_1\ne j_2$ that $u_{kj_1}^a>u_{kj_2}^a$, the following holds:
    %$\hat{u}^a_{kj_1}>u^a_{kj_1}-2\sqrt{\frac{\log T}{n}}\ge\Delta^a+u^a_{kj_2}+2\sqrt{\frac{\log T}{n}}-4\sqrt{\frac{\log T}{n}}>\Delta^a+\hat{u}^a_{kj_2}-4\sqrt{\frac{\log T}{n}}\ge\hat{u}^a_{kj_2}$, which means the arms give accurate feedback.
\end{proof}
\begin{lemma}\label{lemma_3}
    Conditional on $\mathcal{E}$ and $\mathcal{E}^a$, a player $j$ will achieve a confident estimation on $\mathcal{K}_2$ after no more than $\lceil\frac{4(c+2)^2}{K^2\Delta^2}\rceil$  rounds in the "Round Robin" phase.
\end{lemma}
%Lemma \ref{lemma_3} shows that the time steps in the "Round Robin" phase can be upper bounded by a $O(\frac{D^2\log T}{\Delta^2})$ factor with high probability.
\begin{proof}
Note that after $\lceil\frac{4(c+2)^2}{K^2\Delta^2}\rceil$ rounds in the "Round Robin" phase, every available arm is matched with player $j$ for at least $\frac{4(c+2)^2\log T}{\Delta^2}$ time steps during the exploration. Since players only update empirical mean and matched times in the exploration, the matched times of available arms are the same.
    For $k_1,k_2\in \mathcal{K}_2$ that $u_{jk_1}>u_{jk_2}$, conditional on $\mathcal{E}$, we have that:
    \begin{eqnarray*}
        \text{LCB}_{jk_1}&=&\hat{u}_{jk_1}-c\sqrt{\frac{\log T}{N_{jk_1}}}>u_{jk_1}-(c+2)\sqrt{\frac{\log T}{N_{jk_1}}}\ge\Delta_j+u_{jk_2}-(c+2)\sqrt{\frac{\log T}{N_{jk_2}}}\\&>&\Delta_j+\hat{u}_{jk_2}-(c+4)\sqrt{\frac{\log T}{N_{jk_2}}}>\Delta_j+\hat{u}_{jk_2}+c\sqrt{\frac{\log T}{N_{jk_2}}}-(2c+4)\sqrt{\frac{\log T}{N_{jk_2}}}\\&=&\text{UCB}_{jk_2}+\Delta_j-(2c+4)\sqrt{\frac{\log T}{N_{jk_2}}}.
    \end{eqnarray*}
    We can conclude the lemma based on the fact that $N_{jk_2}\ge \frac{4(c+2)^2\log T}{\Delta^2}$ after $\lceil\frac{4(c+2)^2}{K^2\Delta^2}\rceil$ round in the "Round Robin" phase.
\end{proof}

We say a player $j'$ can influence player $j$ if there exist a distinct sequence of remaining players $j_0=j',j_1,...,j_n=j$ and a sequence of available arms $k_1,...,k_n$, such that $j_{i-1} \succ_{k_{i}} j_{i}$ for $i=1,2,...,n$. Otherwise, we say player $j'$ cannot influence player $j$. The following Lemma indicates the transitivity of influence relation.
\begin{lemma}\label{lemma_infulence}
    If a player $j_0$ can influence  player $j'$, and player $j'$ can influence  $j$, then $j_0$ can also influence  player $j$.
\end{lemma}
\begin{proof}
    Since $j_0$ can influence the optimal stable arm of player $j'$, and player $j'$ can influence the optimal stable arm of player $j$, from the definition, there exist remaining players $j_0,j_1,j_2,...,j_m=j',...,j_n=j$ and available arms $k_1,...,k_n$ that satisfy $j_{i-1}\succ_{k_i}j_i$ for $i=1,2,...,n$ (by emerging two sequences). Note that if one of the following cases happens: (1) there exists $m_1<m$ that $j=j_{m_1}$, (2) there exists $m_2>m$ that $j_0=j_{m_2}$, or (3) there exist no $m_1<m$ and $m_2\ge m$ that $j_{m_1}=j_{m_2}$, we can simply conclude the lemma. Otherwise, suppose for $m_1<m$ and $m_2\ge m$ that $j_{m_1}=j_{m_2}$ holds, we can find out that the remaining players $j'_0=j_0,...,j'_{m_1-1}=j_{m_1-1},j'_{m_1}=j_{m_2},j'_{m_1+1}=j_{m_2+1},...,j'_{n'}=j_n$ and available arms $k'_1=k_1,...,k'_{m_1}=k_{m_1},k'_{m_1+1}=k_{m_2+1},...,k'_{n'}=k_n$ satisfy $j'_{i-1}\succ_{k'_i}j'_i$ for $i=1,2,...,n'$. Repeat the above process, we can find a distinct sequence of remaining players $j_0=j^*_0,...,j^*_{n^*}=j$ and a sequence of available arms $k^*_1,...,k^*_{n^*}$ that satisfy $j_{i-1}^*\succ_{k^*_i}j^*_i$ for $i=1,2,...,n^*$ which finishes the proof.
\end{proof}

\begin{lemma}\label{lemma_4}
During a communication, conditional on that arms give accurate feedback, if a player $j$ never gets rejected when receiving, then for any $j'\ne j \in\mathcal{N}_2$, one of the following statements holds:

1)  player $j'$ achieves a confident estimation on the available arm set $\mathcal{K}_2$,

 2) player $j'$ cannot influence player $j$.
\end{lemma}
\begin{proof}
    We prove the lemma by contradiction. Suppose there exists a player $j'\ne j$ who doesn't achieve a confident estimation on the available arm set $\mathcal{K}_2$ and player $j'$ can influence player $j$. Then there exists a distinct sequence of remaining players $j_0=j',j_1,...,j_n=j$ and a   sequence of available arms $k_1,...,k_n$, such that $j_{i-1} \succ_{k_{i}} j_{i}$ for $i=1,2,...,n$. Since player $j'$ fails to achieve a confident estimation and arms give precise feedback, there exists $t_1\le N_1K_1$ in the communication process when $j_1$ will get rejected. Similarly, we can conclude that there exists $t_i$ for $i=1,...,n$ that $t_i\le iN_1K_1$ and at time step $t_i$, player $j_i$ will get rejected. Note that there are at most $N_2$ players remaining, player $j$ will get rejected during the communication which is a contradiction.
\end{proof}
According to the design of Algorithm \ref{algorithm1}, different players may match their potential optimal stable arms after different rounds in the "Round Robin" phase. GALE-SHAPLEY in \citep{gale1962college} is used to help players find their potential optimal stable arms. %We say a player $j'$ will never influence the optimal stable arm of player $j$ %
Note that if player $j'$ cannot influence player $j$,  the pulls of player $j'$ will not influence the output of the potential optimal arm (i.e. \textit{OPT} in Line \ref{updatestart}) for player $j$. %Denote the set of players who will be
\begin{lemma}\label{optarm}
    Conditional on $\mathcal{E}$ and $\mathcal{E}^a$, with probability $1-\frac{2}{T}$, when a player $j$ obtains successful learning,  her potential optimal stable arm equals to her optimal stable arm.
\end{lemma}
\begin{proof}
 Note that different players may obtain successful learning after different rounds in the "Round Robin" phase and there may be multiple players obtain successful learning at the same round. We denote the $n$-th (in the round order) set of players to obtain successful learning by $\mathcal{S}(n)$.  Define the event: $\mathcal{E}^*=\{ \text{all successful players have correct estimations on } \mathcal{K}_2\}\cap\{\text{all arms give accurate feedback after a player achieves a confident estimation}\}$. We prove the statement "conditional on $\mathcal{E}^*$, when a player $j$ obtains successful
learning, her potential optimal stable arm equals to her optimal stable arm" by mathematical induction. If the above statement holds for players in $\mathcal{S}=\cup_{i=1}^{n-1}\mathcal{S}(i)$, we prove the correctness of the statement for players in $\mathcal{S}(n)$. Note that conditional on $\mathcal{E}^*$, all the players in $\mathcal{S}$ will occupy their optimal stable arms. We now verify that any player $j'$ in $\mathcal{S}(m)$ (where $m=n+1,...$) can never influence the optimal stable arm for player $j$ in $\mathcal{S}(n)$. 
By contradiction, if  player $j'$ can influence the optimal stable arm of $j$.  Since player $j'$ fails to obtain successful learning at round $n$,  $j'$ either fails to achieve a confident estimation or has got rejected when receiving during the $n$-th round’s communication. By combing Lemma \ref{lemma_4} and Lemma \ref{lemma_infulence}, there must exist a player $j_0$ (may equal to player $j'$) who fails to achieve confident estimations and $j_0$ can influence the optimal stable arm of $j$.  Then player $j$ must have got rejected when receiving which contradicts the definition of obtaining successful learning. Combining all the above analyses, we can prove the correctness of the statement.  
Now, based on Lemma \ref{lemma_2}, we only need to prove the correctness of every 
 player's estimation on the available arm set $\mathcal{K}_2$ conditional on $\mathcal{E}$. 
    Conditional on $\mathcal{E}$, for any $k_1,k_2 \in \mathcal{K}_2$ that LCB$_{jk_1}>$UCB$_{jk_2}$, have:
    \begin{equation*}
u_{jk_1}>\text{LCB}_{jk_1}>\text{UCB}_{jk_2}>u_{jk_2}.
    \end{equation*}
    Thus, the correctness of player $j$'s estimation is proved, and the origin statement holds.%According to Lemma \ref{}, the players in \mathcal{S}_n will never influence the pull of players in $\mathcal{S}(i)$ enter the “Exploitation” phase with
\end{proof}
\begin{proof}[Proof of Theorem \ref{theorem1}]
Let $r=\lceil\frac{4(c+2)^2}{K^2\Delta^2}\rceil$. By %defining events $\mathcal{E}$ and $\mathcal{E}^a$, 
decomposing the player optimal stable regret and using the above lemmas, we obtain that: 
\begin{eqnarray}
    \overline{R}_j(T)&=&\E[R_1+R_2+R_3|\mathcal{E}\cup\mathcal{E}^a]+T\Pr[\neg\mathcal{E}]+T\Pr[\neg\mathcal{E}^a]\label{EQr1r2r3}\\
    &\le&N+\E[R_2+R_3|\mathcal{E}\cup\mathcal{E}^a]+4KN\label{EQ2}
    \\
    &\le&N+K^3r \lceil\log T\rceil+r(KN^2(N-1)+N^2+NK+N)+4KN+2. \label{EQ3}
\end{eqnarray}

  In Eq.\ref{EQr1r2r3}, $R_1$ represents the regret in the "Index Assignment" phase, $R_2$ represents the regret in the "Round Robin" phase, and $R_3$ represents the regret in the "Exploitation" phase. Eq.\ref{EQ2} holds based on Lemma \ref{lemma_event} and the fact that the "Index Assignment" phase lasts for $N$ time steps. Combining Lemma \ref{lemma_2}, Lemma \ref{lemma_3} and Lemma \ref{optarm}, we conclude that, conditional on $\mathcal{E}$ and $\mathcal{E}^a$, with probability more than $1-\frac{2}{T}$, player will enter the "Exploitation" phase with optimal stable arm after no more than $r$ rounds in "Round Robin" phase. Thus, Eq.\ref{EQ3} holds.

  As for arm-pessimal stable regret, we can easily conclude the result according to the fact that: if all players match with their optimal stable arms, then all arms match with their pessimal stable players.
\end{proof}
\section{Unknown Time Horizon}
 

In this section, we extend the setting where the time horizon $T$ is unknown. 

The doubling trick (\citep{besson2018doubling,auer1995gambling}) is a commonly used method to address unknown time horizon $T$ and converses the bound of $O(\log T)$. We adopt the doubling trick both on the total time horizon the exploration.%The corresponding theorem in $D$-PEA scenario is presented below as an example, pseudocode and other corresponding theorems can be seen in Appendix.

By using exponential doubling trick, the whole time horizon $T$ is divided into several periods. In every period $r_1$, all players will suppose the time horizon $T_{r_1}=2^{2^{r_1}}$. When they act more than $T_{r_1}$ time steps in total, they will update their assumption and enter the next period, i.e. suppose $T_{r_1+1}=2^{2^{r_1+1}}$. The doubling trick will also be used in the exploration. Specifically, the first exploration will last for $2K_2$ time steps, the second exploration lasts for $2\cdot2K_2=4K_2$ time steps, the third for $2\cdot4K_2=8K_2$ time steps, and so on.

Moreover,  we suppose that arms are also not aware of the time horizon $T$. Thus, they also update their beliefs. Define the event $\mathcal{E}^a(r_1)=\{\forall j \in \mathcal{N}, k \in \mathcal{K}, |\hat{u}^a_{kj}-u^a_{kj}|<2\sqrt{\frac{2^{r_1}}{N^a_{kj}}}\}$. 
 We say the arms adopt modified $R$ rational method with unknown time horizon, if for every period $r_1$, conditional on $\mathcal{E}^a(r_1)$, after no more than $R\frac{ 2^{r_1}}{(\Delta^a)^2}$ samples for every player, the arms can estimate their utilities accurately.%we propose the modified $R$-sample-efficient method, i.e. in every period $r_1$, 
\begin{algorithm}
\caption{Round Robin ETC (for a player $j$ with unknown $T$)}\label{al_unknown}
    \begin{algorithmic}[1]
    \State Index $\gets$ \textit{INDEX-ASSIGNMENT($N,\mathcal{K}$)}
    \For{$r_1=1,2,...$}\label{period}
    \State OPT $\gets \emptyset$, $N_2\gets N, \mathcal{K}_2 \gets \mathcal{K},r_2\gets 1$ 
    \While{OPT$=\emptyset$}\text{\# when $j$ hasn't found her potential optimal stable arm yet}
    \For{$t=1,2,...,2^{r_2}K_2$} \text{\# Exploration Sub-Phase}
    \State Pull $(\text{Index}+t)\mod K_2=m$-th arm in $\mathcal{K}_2$, update $\hat{u}_{jk}$, $N_{jk}$, $r_2\hspace{-0.02 in} \gets\hspace{-0.025 in} r_2+1$
    \EndFor
    \If {for every $k_1\ne k_2\in \mathcal{K}_2$, $\text{UCB}_{jk_1}<\text{LCB}_{jk_2}$ or $\text{LCB}_{jk_1}>\text{UCB}_{jk_2}$} \label{UCBLCB}
    \State Success $\gets 1$ \text{\# the player achieves a confident estimation}
    \EndIf
        \text{\# Communication Sub-Phase}
    \State Success $\gets$ \textit{COMM(\text{Index, Success}, $,N_2,K_2,\mathcal{K}_2$)}
     \Statex \# Update Sub-Phase
        \State OPT $\gets$ \textit{GALE-SHAPLEY}$(\text{Success},N_2,\mathcal{K}_2,\hat{\boldsymbol{u}}_j,\boldsymbol{N}_j)$
    \If{Success$=1$} \textbf{Break while}
    \EndIf\For{$t=1,...,N_2K_2$}
    \If {$t=(\text{Index}-1)K_2+m$} 
    \State Pull arm $k$ that is $m$-th arm in $\mathcal{K}_2$
    \If {$C_j=1$} $\mathcal{K}_1\gets \mathcal{K}_1\setminus \{k\}, N_1=N_1-1$
    \EndIf
    \EndIf
    \EndFor
    \State $N_2\gets N_1, \mathcal{K}_2\gets\mathcal{K}_1$ 
    \State Index $\gets$ \textit{INDEX-ASSIGNMENT($N_2,\mathcal{K}_2$)}\label{updateend2}
    \EndWhile
    %\State $(\text{Index},N_2,\mathcal{K}_2) \gets$ \textit{UPDATE}$(\text{Index},N_2,\mathcal{K}_2)$
    \State Pull OPT arm
    \EndFor
    \end{algorithmic}
\end{algorithm}
%\section{Regret For Arms}
\begin{theorem}\label{al_doub}
     If every player runs Algorithm \ref{al_unknown}, and arms adopt modified strategies that satisfy $R$-rational condition, then the optimal stable regret of any player $j$ can be upper bounded by:
     \begin{equation}
         \overline{R}_j(T)\hspace{-0.02 in}\le N+\frac{32K(c\hspace{-0.05 in}+\hspace{-0.05 in}2)^2\hspace{-0.02 in}\log T}{\Delta^2}+rN\log(\frac{32K(c\hspace{-0.05 in}+\hspace{-0.05 in}2)^2\hspace{-0.02 in}\log T}{\Delta^2})(KN(N\hspace{-0.02 in}-\hspace{-0.02 in} 1)+N+K+1)+(4KN+2)r,
     \end{equation}
     where $r=\log \log T +1$.
\end{theorem}
\begin{theorem}\label{al_dou_exploration}
    If every player runs the modified algorithm of Algorithm \ref{algorithm1} based on doubling trick on the exploration, and arms adopt $R$-rational method, then the optimal stable regret of any player $j$ can be upper bounded by \footnote{Similar result for arm pessimal stable regret can be simply obtained.}:
    \begin{equation}
        \overline{R}_j(T)\le N+\frac{8K(c\hspace{-0.05 in}+\hspace{-0.05 in}2)^2\log T}{\Delta^2}+N\log(\frac{16K(c\hspace{-0.05 in}+\hspace{-0.05 in}2)^2\log T}{\Delta^2})(KN(N-1)+N+K+1)+4KN+2.
    \end{equation}
\end{theorem}
\begin{proof}
Since doubling trick is only used on the exploration, after $r$ rounds of exploration, every available arm is explored for $2^{r+1}-2$ time steps. By similar analysis with Lemma \ref{lemma_3}, we can conclude that conditional on $\mathcal{E}$, after no more than $\frac{8K(c+2)^2\log T}{\Delta^2}$ times in the exploration, every player will achieve a confident estimation on the available arm set $\mathcal{K}_2$. Then following the proof in Theorem \ref{theorem1}, we can simply get this theorem.
\end{proof}
\begin{proof}[Proof of Theorem \ref{al_doub}]
According to the design of Algorithm \ref{al_unknown} and  Theorem \ref{al_dou_exploration}, we can simply get the conclusion by summing regret in each period.
\end{proof}
\begin{remark}
    Similar results for arm regret can be easily obtained due to the fact that:  if all players match with their optimal stable arms, then all arms match with their pessimal stable players.
\end{remark}
\section{Omitted Proofs in Section \ref{EXTENSION}
 }\label{sec_extension}

In this section, we provide a regret analysis for the collaborative case. Before we prove the main theorem, we provide some lemmas that will be useful.
\begin{lemma}\label{lemma_success}
    If player $j$ obtains successful learning, all participants achieve confident estimations and all players obtain successful learning.
\end{lemma}
\begin{proof}
    According to the design of the algorithm, if player $j$ achieves successful learning, she succeeds at the time step corresponding to her index on each arm during communication. It's important to note that arm $1$ will choose player $j$ at that time step only if and when all players attain confident estimations, choose arm $1$ during the previous check, and arm $1$ achieves a confident estimation. Other arms will select player $j$ at the time step corresponding to her index only when they achieve confident estimations. As a result, all participants attain confident estimations. Furthermore, it can be easily concluded that all players achieve successful learning.
\end{proof}
\begin{lemma}\label{lemma_r}
    Conditional on $\mathcal{E}$ and $\mathcal{E}^a$, a participant will achieve a confident estimation after no more than $\lceil\frac{64}{K^2\Delta_*^2}\rceil$  rounds of exploration, where $\Delta_*=\min\{\Delta,\Delta^a\}$.
\end{lemma}
\begin{proof}
    After $\lceil\frac{64}{K^2\Delta_*^2}\rceil$  rounds of exploration, every arm is matched with each player for at least $\frac{64}{\Delta_*^2}\log T$ time steps, i.e. $N_{jk}\ge \frac{64\log T}{\Delta_*^2}$ and  $N_{kj}^a\ge \frac{64\log T}{\Delta_*^2}$ hold for every player $j$ and every arm $k$.  For $k_1,k_2\in \mathcal{K}$ that $u_{jk_1}>u_{jk_2}$, conditional on $\mathcal{E}$, we have that:
    \begin{eqnarray*}
        \text{LCB}_{jk_1}&=&\hat{u}_{jk_1}-2\sqrt{\frac{\log T}{N_{jk_1}}}>u_{jk_1}-4\sqrt{\frac{\log T}{N_{jk_1}}}\ge\Delta_j+u_{jk_2}-4\sqrt{\frac{\log T}{N_{jk_1}}}\\&>&\Delta_j+\hat{u}_{jk_2}-2\sqrt{\frac{\log T}{N_{jk_2}}}-4\sqrt{\frac{\log T}{N_{jk_1}}}>\Delta_j+\hat{u}_{jk_2}+2\sqrt{\frac{\log T}{N_{jk_2}}}-4\sqrt{\frac{\log T}{N_{jk_2}}}-4\sqrt{\frac{\log T}{N_{jk_1}}}\\&=&\text{UCB}_{jk_2}+\Delta_j-4\sqrt{\frac{\log T}{N_{jk_2}}}-4\sqrt{\frac{\log T}{N_{jk_1}}}
        \\&\ge&\text{UCB}_{jk_2}.
    \end{eqnarray*}
    Similarly, we can prove that  for $j_1,j_2\in \mathcal{N}$ that $u_{kj_1}^a>u_{kj_2}^a$, conditional on $\mathcal{E}^a$, $\text{LCB}^a_{kj_1} >\text{UCB}^a_{kj_2}$.
\end{proof}
\begin{lemma}\label{lemma_correct}
    If a participant achieves a confident estimation, conditional on $\mathcal{E}$ and $\mathcal{E}^a$, the estimation of the participant is correct.
\end{lemma}
\begin{proof}
     Conditional on $\mathcal{E}$, for any $k_1,k_2 \in \mathcal{K}_2$ that LCB$_{jk_1}>$UCB$_{jk_2}$, have:
    \begin{equation*}
u_{jk_1}>\text{LCB}_{jk_1}>\text{UCB}_{jk_2}>u_{jk_2}.
    \end{equation*}
    Thus, the correctness of player $j$'s estimation is proved. The correctness of arms' estimations can be similarly obtained.
\end{proof}

Combining Lemma \ref{lemma_success} and Lemma \ref{lemma_correct} and based on the property of \textit{GALE-SHAPLEY} algorithm, we can conclude that, conditional on $\mathcal{E}$ and $\mathcal{E}^a$, once a player obtains successful learning, she will exploit her optimal stable arm till the end. Together with these lemmas and analysis, we now move to our main theorem.

\begin{proof}
    By decomposing the player optimal stable regret and using the above lemmas, we obtain 
\begin{eqnarray}
    \overline{R}_j(T)&=&\E[R_1+R_2+R_3|\mathcal{E}\cup\mathcal{E}^a]+T\Pr[\neg\mathcal{E}]+T\Pr[\neg\mathcal{E}^a]\label{EQr1r2r31}\\
    &\le&N+\E[R_2+R_3|\mathcal{E}\cup\mathcal{E}^a]+4KN\label{EQ21}
    \\
    &\le&N+K^3r \lceil\log T\rceil+r(1+KN)+4KN. \label{EQ31}
\end{eqnarray}

  In Eq.\ref{EQr1r2r31}, $R_1$ represents the regret in the "Index Assignment" procedure, $R_2$ represents the regret caused by the exploration and communication, and $R_3$ represents the regret caused by exploitation. Eq.\ref{EQ21} holds based on Lemma \ref{lemma_event} and the fact that the "Index Assignment" phase lasts for $N$ time steps. Combining Lemma \ref{lemma_success}, Lemma \ref{lemma_r}, and Lemma \ref{lemma_correct}, we conclude that, conditional on $\mathcal{E}$ and $\mathcal{E}^a$,  player will exploit optimal stable arm after no more than $r$ rounds of exploration and communication. Thus, Eq.\ref{EQ31} holds.

  As for arm-pessimal stable regret, we can easily conclude the result according to the fact that: if all players match with their optimal stable arms, then all arms match with their pessimal stable players.
\end{proof}
\section{Simulation}\label{sec_simulation}
In this section, we provide numerical results to show the performance of our algorithm. %For all experiments, the ranks of arms and players are all determined randomly, and we assume that the arms take actions according to empirical means.
We estimate the average player-optimal stable regret and standard deviations of regret over $30$ independent runs.  %Moreover, in order to show the dependence of market size in our algorithms, we compute the average regret for players. 
\begin{figure}[h]
\centering  
\subfigure[Case 1]{
\label{Fig.sub.1}
\includegraphics[width=0.3\textwidth]{CASE 1.pdf}}
\subfigure[Case 2]{
\label{Fig.sub.2}
\includegraphics[width=0.3\textwidth]{CASE 2.pdf}}
\subfigure[Case 3]{
\includegraphics[width=0.285\textwidth]{simulation_4.pdf}}
\vspace{-0.05in}
%\caption{Comparison between SUBMARINE and baseline.}
\label{Fig.main}
\end{figure}

\textbf{Baselines.}
\begin{itemize}
   \item \textit{PCA-UCB} is a conflict-avoiding algorithm with the random delay parameter $\lambda$. This algorithm extends CA-UCB, which only achieves a $O(\log ^2 T)$ regret bound compared to player-pessimal regret, even in the one-sided setting. We set $\lambda=0.9$ based on the simulations in \citep{pokharel2023converging}. Since \cite{pokharel2023converging} do not reveal detailed strategies for the arm side, in our simulations, we assume that the arms choose candidates with the highest UCB.
        \item \textit{CA-ETC} is a multi-epoch ETC type algorithm that theoretically obtains the regret. Same as the simulation in \citep{pagare2023two}, we choose $\gamma$, which determines horizon length, to be $0.25$. In \citep{pagare2023two}, the epoch length $T_0$ is chosen based on $\Delta^a$ and $\Delta$. The authors do not disclose the specific details of how $T_0$ is chosen for simulations, but they emphasize that $T_0$ should be optimistically high. Thus, we set $T_0$ to be $50000$. Moreover, CA-ETC requires arms to adopt specific symmetric strategies compared to players.
\end{itemize}
We investigate three scenarios in the context of a multi-armed bandit problem involving five players and five arms in two instances, and four players and four arms in one instance. In the former two cases, the minimum gaps between players and arms are set to 0.2, while in the latter case, the minimum gap is $0.25$. The preferences for these scenarios are described as follows:

(1) Case $1$:
\begin{eqnarray*}
    p_1: a_4 \succ a_1 \succ a_2 \succ a_3 \succ a_5, \quad a_1: p_1 \succ^a p_4 \succ^a p_2 \succ^a p_3 \succ^a p_5,\\
    p_2: a_5 \succ a_2 \succ a_1 \succ a_3 \succ a_4, \quad a_2: p_2 \succ^a p_5 \succ^a p_3 \succ^a p_1 \succ^a p_4,\\
    p_3: a_3 \succ a_4 \succ a_2 \succ a_5 \succ a_1, \quad a_3: p_2 \succ^a p_1 \succ^a p_3 \succ^a p_5 \succ^a p_4,\\
    p_4: a_2 \succ a_1 \succ a_3 \succ a_5 \succ a_4, \quad a_4: p_3 \succ^a p_5 \succ^a p_2 \succ^a p_4 \succ^a p_1,\\
    p_5: a_1 \succ a_3 \succ a_4 \succ a_2 \succ a_5, \quad a_5: p_1 \succ^a p_3 \succ^a p_2 \succ^a p_4 \succ^a p_5. 
\end{eqnarray*}

(2) Case $2$:
\begin{eqnarray*}
    p_1: a_4 \succ a_1 \succ a_5 \succ a_2 \succ a_3, \quad a_1: p_3 \succ^a p_1 \succ^a p_5 \succ^a p_2 \succ^a p_4,\\
    p_2: a_5 \succ a_1 \succ a_2 \succ a_4 \succ a_3, \quad a_2: p_5 \succ^a p_2 \succ^a p_1 \succ^a p_4 \succ^a p_3,\\
    p_3: a_2 \succ a_5 \succ a_3 \succ a_1 \succ a_4, \quad a_3: p_3 \succ^a p_1 \succ^a p_2 \succ^a p_5 \succ^a p_4,\\
    p_4: a_5 \succ a_2 \succ a_1 \succ a_3 \succ a_4, \quad a_4: p_1 \succ^a p_2 \succ^a p_5 \succ^a p_4 \succ^a p_3,\\
    p_5: a_3 \succ a_5 \succ a_2 \succ a_4 \succ a_1, \quad a_5: p_1 \succ^a p_4 \succ^a p_5 \succ^a p_3 \succ^a p_2. 
\end{eqnarray*}

(3) Case $3$:
\begin{eqnarray*}
    p_1: a_2 \succ a_1  \succ a_4 \succ a_3, \quad a_1: p_2 \succ^a p_1 \succ^a p_4 \succ^a p_3,\\
    p_2: a_4 \succ a_1 \succ a_2  \succ a_3, \quad a_2: p_4 \succ^a p_2 \succ^a p_1 \succ^a p_3\\
    p_3: a_3  \succ a_2 \succ a_1 \succ a_4, \quad a_3:  p_1 \succ^a p_3 \succ^a p_4 \succ^a p_2,\\
    p_4:  a_1 \succ a_2 \succ a_3 \succ a_4, \quad a_4:  p_2 \succ^a p_4 \succ^a p_3 \succ^a p_1.
\end{eqnarray*}
 From the figures, we can conclude that round-robin ETC outperforms baselines in all cases. Additionally, the results of round-robin ETC and  CA-ETC exhibit greater stability than those of PCA-UCB.
 
 The reason why PCA-UCB performs unstably and fails to obtain sublinear results  may be as follows:

Firstly, in different runs of simulations, PCA-UCB may converge to different stable matchings instead of consistently converging to the player-optimal stable matching.
This variability in convergence could be a significant challenge, as the player-optimal stable matching is more desirable for players. Furthermore,  in \citep{liu2021bandit}, they illustrate an example where even the centralized UCB cannot achieve sub-linear player-optimal regret. 

Secondly, when applying PCA-UCB, players adopt a UCB-type method to choose arms, resulting in insufficient samples for arms to learn their preferences. Consequently, arms may provide inaccurate feedback in the two-sided learning setting, potentially leading to unstable matching or an extended time to convergence.

Regarding CA-ETC, it is important to note that the players persist in exploring arms even after each participant has acquired knowledge of her own preferences. Consequently, regret continues to accumulate over time. The regret curve exhibits a stair-like pattern, reflecting the periodic increments in regret.

Furthermore, analysis of the depicted data indicates a consistent decrease in regret associated with CA-ECT and Round-Robin ETC as both the number of players and arms decreases, and as the minimal gap increases. In contrast, the regret observed for PCA-UCB exhibits an increase. This trend may be attributed to several factors outlined previously: firstly, the tendency to converge towards lower-quality stable matchings as opposed to player-optimal stable matchings; and secondly, the failure to converge and persistently selecting lower-quality arms. Notably, these issues are intricately linked to the preference structure and detailed utilities rather than the scale of the market or the minimal gap.

