\section{ASYNCHRONOUS ALGORITHMS FOR FEDERATED MAB}
In this section, we propose the first asynchronous algorithm for the pure exploration problem of federated MAB. 
As mentioned in Section \ref{sec:intro}, a key challenge in conducting pure exploration via asynchronous communication is the absence of dedicated synchronous communication rounds where the server can assign arms to explore each agent based on their latest observations. Moreover, there is no guarantee on when or whether an agent would become active again to execute the exploration and report its observations back. This severely hinders the applicability of all existing distributed/federated pure exploration algorithms, whose exploration strategies are based on experimental design \citep{Hillel2013DistributedEI,Du2021CollaborativePE,Reda2022NearOptimalCL}. In order to address this challenge, we adopt a fully adaptive exploration strategy, such that each agent separately and asynchronously decides which arm to pull, based on the statistics received from the server in its latest communication. We name the resulting algorithm Federated Asynchronous MAB Pure Exploration (\texttt{FAMABPE}), and its description is given in Algorithm \ref{alg3}.


\textbf{FAMABPE algorithm } As illustrated in lines 2-7, Algorithm \ref{alg3} begins with an initialization step for $K$ rounds, where the $K$ arms are pulled sequentially. Then the agents and the server update their local statistics accordingly. For round $t\ge K+1$, an agent $m_t$ becomes active and computes its empirical best arm $i_{m_t,t}$ and the most ambiguous arm $j_{m_t,t}$, where
\begin{align}\label{alg1eq1}
\begin{split}
   i_{m_t,t} =& \arg\max_{k\in\A} \hat{\mu}_{m_t,t}(k),\\  j_{m_t,t} =& \arg\max_{k \in \A/ \{i_{m_t,t}\}} \hat{\Delta}_{m_t,t}(k,i_{m_t,t})\\& + \alpha^{M}_{m_t,t}(i_{m_t,t},k),
\end{split}
\end{align}
based on which, it selects the most informative arm $k_{m_t,t} = \arg\max_{k\in\{i_{m_t,t},j_{m_t,t}\}} \alpha^M_{m_t,t}(k)$ to pull in round $t$. We define the arm $k$'s reward estimator of the agent as $\hat{\mu}_{m_t,t}(k)$, the estimated reward gap between arm $i$ and $j$ of the agent as $\hat{\Delta}_{m_t,t}(i,j) = \hat{\mu}_{m_t,t}(i) - \hat{\mu}_{m_t,t}(j)$ and the pair $(i,j)$'s exploration bonus of the agents as $\alpha^M_{m_t,t}(i,j) = \alpha^M_{m_t,t}(i) + \alpha^M_{m_t,t}(j)$ (the definition of $\alpha^M_{m_t,t}(k)$ would be provided in Theorem \ref{theorem1}). Intuitively, pulling $k_{m_t,t}$ can most decrease $\alpha^M_{m_t,t}(i_{m_t,t},j_{m_t,t})$ and thus help reduce sample complexity. After observing reward $r_{m_{t},t}$ corresponding to $k_{m_{t},t}$, $m_t$ checks the communication event in line 11. If the event is true, agent $m_t$ would upload its local reward sum $S^{loc}_{m_t,t}(k)$ and local observation number $T^{loc}_{m_t,t}(k)$, $\forall k\in\A$ to the server. The server then updates its data and estimation
\begin{align}\label{alg1eq2}
\begin{split}
       &\hat{\mu}_{ser,t}(k) = \frac{\hat{\mu}_{ser,t-1}(k)T_{ser,t-1}(k) + S_{m_t,t}^{loc}(k)}{T_{t-1}^{ser}(k) + T^{loc}_{m_t,t}(k)},\\ & T_{ser,t}(k) = T_{ser,t-1}(k) + T_{m_t,t}^{loc}(k),\ \forall k\in \A
\end{split}
\end{align}
and
\begin{align}\label{alg1eq3}
    \begin{split}
       & i_{ser,t} = \arg\max_{k\in\A} \hat{\mu}_{ser,t}(k),\\&  j_{ser,t} = \arg\max_{k \in \A/\{i_{ser,t}\}} \hat{\Delta}_{ser,t}(k,i_{ser,t}) + \alpha^M_{ser,t}(i_{ser,t},k),
       \\ &  B(t) =  \hat{\Delta}_{ser,t}(j_{ser,t},i_{ser,t}) + \alpha^M_{ser,t}(i_{ser,t},j_{ser,t}),
    \end{split}
\end{align}
where $\hat{\mu}_{ser,t}$ denotes the arm $k$'s reward estimator of the server, $T_{ser,t}(k)$ denotes the arm $k$'s observation number of the server, $\hat{\Delta}_{ser,t}(i,j) = \hat{\mu}_{ser,t}(i) - \hat{\mu}_{ser,t}(j)$ denotes the estimated reward gap of the server, and $\alpha^M_{ser,t}(i,j) = \alpha^M_{ser,t}(i) + \alpha^M_{ser,t}(j)$ (the setup of $\alpha^M_{ser,t}(k)$ is shown in Theorem \ref{theorem1}) denotes the pair $(i,j)$'s exploration bonus of the server. If the breaking index $B(t) \le \epsilon$, the server would set its estimated best arm $\hat{k}^* = i_{ser,t}$ and terminate the algorithm (which implies $\tau = t$). Otherwise, agent $m_t$ would download $\hat{\mu}_{ser,t}(k)$ and $T_{ser,t}(k)$, $\forall k\in\A$ from the server and update its local data as shown in lines 18-19. More details are shown in the pseudo-code.

% \begin{remark}\label{remark2}
\paragraph{Low switching cost }
Different from the previous distributed/federated pure exploration algorithms \citep{Hillel2013DistributedEI,Du2021CollaborativePE,Reda2022NearOptimalCL}, \texttt{FAMABPE} enjoys a low switching cost (i.e., $1/2\C(\tau)$). The definition of the switching cost is the number of times the agent $m\in\M$ updates $k_{m,t}$ \citep{AbbasiYadkori2011ImprovedAF,He2022ASA,li2023learning}. We suppose $t_1$ and $t_2$ are two neighborhood communication rounds of agent $m$, and $\hat{\mu}_{m,t}(k)$ and $T_{m,t}(k)$, $\forall k\in\A$, would remain unchanged from round $t_{1}+1$ to $t_2$ (line 20$\sim$22 in Algorithm \ref{alg3}). This implies $k_{m,t}$, would also remain unchanged. Hence, the switching cost of \texttt{FAMABPE} equals the total communication number.
% \end{remark}

% \begin{remark}\label{whycomm}
\paragraph{Design of communication event }
The event-triggered communication strategy of \texttt{FAMABPE} can control the amount of local data that each agent $m\in \M$ hasn't uploaded, i.e., $\sum_{k=1}^KT^{loc}_{m,t}(k)$ and the size of the exploration bonuses simultaneously. Note that in our setting, neither the agents nor the server knows the total number of observations in the system, i.e., time index $t$. Therefore, we utilize $\sum_{k=1}^KT_{m_t,t}(k)$ and $\sum_{k=1}^K T_{ser,t}(k)$ to establish the exploration bonuses of agents and server, respectively. This requires $\sum_{k=1}^KT_{m_t,t}(k)$ and $\sum_{k=1}^K T_{ser,t}(k)$ to be in a desired proportion to $t$ (which is different from \cite{Li2021AsynchronousUC,He2022ASA,li2023learning}). Besides, when the server terminates the algorithm, some agents may possess data that has not been uploaded to the server. We wish the amount of these data to be small compared with the sample complexity $\tau$ since they have no contribution to identifying $\hat k^*$. Our event-triggered communication protocol can efficiently limit the number of the useless samples.
% \end{remark}

\begin{algorithm*}[t]
 \centering
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
	\caption{Federated Asynchronous MAB Pure Exploration (\texttt{FAMABPE}) }
    \label{alg3}
	\begin{algorithmic}[1]
            \STATE \textbf{Inputs:} Arm set $\A$, agent set $\M$, triggered parameter $\gamma$ and $(\delta,\epsilon)$
            \STATE \textbf{Initialization:}
            \STATE  From round $1$ to $K$ sequentially pulls arm from $1$ to $K$ and receives reward  $r_{t}$, $\forall t\in[K]$ 
            \STATE Server sets $\hat{\mu}_{ser,K}(t) = r_{t}$ and $T_{ser,K}(t) = 1$ \COMMENT{{\color{blue}Server initialization}}
            \FOR{$m=1:M$}
            \STATE Agent $m$  sets $\hat{\mu}_{m,K+1}(k) = r_{t}$, $T_{m,K+1}(k) = 1$ and $T_{m,K}^{loc}(k) = S_{m,K}^{loc}(k) = 0$, $\forall k\in\A$  \COMMENT{{\color{blue}Agents initialization}}
            \ENDFOR
            \FOR {$t = K+1:\infty$}
            \STATE Agent $m_t$ sets $i_{m_t,t}$ and $j_{m_t,t}$ based on (\ref{alg1eq1}), pulls arm $k_{m_t,t}$ and receives reward $r_{m_t,t}$ \COMMENT{{\color{blue}Sampling rule}}
            \STATE Agent $m_t$ sets $S_{m_t,t}^{loc}(k_{m_t,t}) = S_{m_t,t-1}^{loc}(k_{m_t,t}) + r_{m_t,t}$ and $T_{m_t,t}^{loc}(k_{m_t,t}) = T_{m_t,t-1}^{loc}(k_{m_t,t})+1$
           \IF {$\sum_{k=1}^K(T_{m_t,t}(k) + T_{m_t,t}^{loc}(k)) > (1+\gamma)\sum_{k=1}^KT_{m_t,t}(k)$} 
            \STATE \textbf{[Agent $m_t$ $\rightarrow$ Server]} Send $S_{m_t,t}^{loc}(k)$ and $T_{m_t,t}^{loc}(k)$, $\forall k\in\A$ to the server \COMMENT{{\color{blue}Upload data to server}}
            \STATE Server updates $\hat{\mu}_{ser,t}(k)$, $T_{ser,t}(k)$, $\forall k\in \A$, $i_{ser,t}$,  $j_{ser,t}$
            and  $B(t)$ based on (\ref{alg1eq2}) and (\ref{alg1eq3})
            \IF {$B(t) \le \epsilon$} 
            \STATE Server returns $i_{ser,t}$ as the estimated best arm $\hat k^*$ and break \COMMENT{{\color{blue}Stopping rule and decision rule}}
            \ENDIF
            \STATE \textbf{[Server $\rightarrow$ Agent $m_t$]} Send $T_{ser,t}(k)$ and $\hat{\mu}_{ser,t}(k)$, $\forall k\in\A$ to agent $m_t$ \COMMENT{{\color{blue}Download data from server}}
            \STATE Agent $m_t$ sets $T_{m_t,t+1}(k) = T_{ser,t}(k)$ and $ \hat{\mu}_{m_t,t+1}(k) = \hat{\mu}_{ser,t}(k)$, $\forall k\in\A$
            \STATE Agent $m_t$ sets $T_{m_t,t}^{loc}(k) = 0$ and $S_{m_t,t}^{loc}(k) = 0$, $\forall k\in \A$
            \ELSE
            \STATE Agent $m_t$ sets $T_{m_t,t+1}(k) = T_{m_t,t}(k)$ and $ \hat{\mu}_{m_t,t+1}(k) = \hat{\mu}_{m_t,t}(k)$, $\forall k\in\A$
            \ENDIF
            \STATE Inactive agent $m\not = m_t$ sets $T_{m,t+1}(k) = T_{m,t}(k)$ and $ \hat{\mu}_{m,t+1}(k) = \hat{\mu}_{m,t}(k)$, $\forall k\in\A$
            \ENDFOR
	\end{algorithmic}  
\end{algorithm*}

We can show that our proposed \texttt{FAMABPE} algorithm can attain near-optimal sample complexity $\tau$, with a low communication cost $\C(\tau)$, which is given in the following theorem.
\begin{theorem} \label{theorem1} With $\gamma = 1/(2MK)$ and exploration bonuses
    \begin{align}\label{7}
    \begin{split}
    &\alpha^M_{m_t,t}(k)=\\& \sigma\sqrt{\frac{2}{T_{m_t,t}(k)}\log\bigg(\frac{4K}{\delta}\Big((1+\gamma M)\sum_{k=1}^K T_{m_t,t}(k)\Big)^2\bigg)}\\ &\alpha^M_{ser,t}(k) =\\& \sigma\sqrt{\frac{2}{T_{ser,t}(k)}\log\bigg(\frac{4K}{\delta}\Big((1+\gamma M)\sum_{k=1}^K T_{ser,t}(k)\Big)^2\bigg)},
    \end{split}
    \end{align}
the estimated best arm $\hat k^*$ of \texttt{FAMABPE} can satisfy condition (\ref{1}) and with probability at least $1-\delta$ the sample complexity can be bounded by
\begin{align}
\begin{split}
\nonumber
    \tau \le& \frac{M + 1/(2K)}{M - 1/2}H^M_{\epsilon}2\log\bigg(\frac{4K}{\delta}\Big(\Big(1+1/(2K)\Big)\Lambda\Big)^{2}\bigg),
\end{split}
\end{align}
where 
\begin{align}
\begin{split}
\nonumber
        H^M_\epsilon &= \sum_{k=1}^K \frac{\sigma^2}{\max\big(\frac{\Delta(k^*,k) + \epsilon}{3},\epsilon\big)^2}\\ & = O\bigg(\sum_{k=1}^K  \frac{1}{(\Delta(k^*,k) + \epsilon)^2}\bigg)
\end{split}
\end{align}
is the problem complexity in the MAB \citep{Gabillon2012BestAI} and
 \begin{align}
 \nonumber
     \Lambda = \bigg(\frac{M + 1/(2K)}{M - 1/2}H^M_{\epsilon}4\bigg)^2\frac{4K(1+1/(2K))}{\delta^{1/2}}.
 \end{align}
The communication cost satisfies $\C(\tau) = \tilde{O}( KM )$.
\end{theorem}

\paragraph{Proof sketch of Theorem \ref{theorem1} } 
Proof of Theorem \ref{theorem1} consists of three main components: a) the communication cost $\C(\tau)$; b) the sample complexity $\tau$; c) the estimated best arm satisfies Eq~\eqref{1}. 
Specifically, to upper bound the total communications cost $\mathcal{C}(\tau)$, we utilize the property of the event-trigger that controls when the agents would communicate with the server (Lemma \ref{lemmacommunication1} in the Appendix).
To upper bound the sample complexity $\tau$, we first need to establish the relation between $\sum_{k=1}^KT_{ser,t}(k)$ and $\sum_{k=1}^KT^{loc}_{m,t}(k)$ based on the event triggered strategy (Lemma \ref{lemmarela1} in the Appendix). Then, we establish exploration bonuses by $\sum_{k=1}^K T_{ser,t}(k)$ and $\sum_{k=1}^K T_{m,t}(k)$, $\forall k\in\A,\ m\in\M$ and bound $T_{ser,\tau}(k)$, $\forall k\in\A$ accordingly (Lemma \ref{lemmaprobabilitybound}, \ref{lemmabound1} and \ref{serverlemma1} in the Appendix). Finally, utilizing the relations of $T_{ser,t}(k)$ and $T^{all}_{t}(k) = T_{ser,t}(k) + \sum_{m=1}^M T^{loc}_{m,t}(k) = \sum_{s=1}^t \bone\{k_{m_t,t} = k\}$, we can bound $T^{all}_{\tau}(k)$, $\forall k\in\A$, and $\tau = \sum_{k=1}^K T^{all}_\tau(k)$. The guarantee of finding the best arm, i.e., Eq~\eqref{1}, directly follows the property of the breaking index, i.e., if $B(\tau) \le \epsilon$, then $\Delta(k^*,\hat{k}^*) \le \epsilon$ with probability at least $1-\delta$. 

\begin{remark}
    % From Theorem \ref{theorem1}, t
    The sample complexity of \texttt{FAMABPE} (i.e., $\tau = O(H^M_\epsilon \log(H^{M}_\epsilon/\delta))$) can match the 
    % sample complexity 
    lower bound of $(\epsilon,\delta)$ pure exploration problem $\Omega(\sum_{k=1}^K \log(1/\delta)/(\Delta(k^*,k) + \epsilon)^2)$ (see details in Lemma 1 of \cite{Kaufmann2014OnTC}) up to a constant factor. It implies if we run $(\epsilon,\delta)$ pure exploration algorithms on $M$ agents independently with no communication,
    the sample complexity is $O(M\sum_{k=1}^K \log(1/\delta)/(\Delta(k^*,k) + \epsilon)^2)$ and \texttt{FAMABPE} can accelerate the learning process $O(M)$ times. 
    In terms of communication cost, \texttt{FAMABPE}'s linear dependence on $M$ 
    % in the communication upper bound 
    matches that attained by previous works studying distributed/federated pure exploration under the less challenging synchronous communication environment \citep{Hillel2013DistributedEI,Karpov2020CollaborativeTD,Reda2022NearOptimalCL,Reddy2022AlmostCC}. Moreover, the factor $K$ is due to the communication event that ensures $\sum_{k=1}^K T_{m,t}(k)$ and $\sum_{k=1}^K T^{loc}_{m,t}(k)$ are in a desired proportion to $t$. As mentioned in our previous discussion on its design, this is necessary for the asynchronous communication studied in this paper.
    % Remark \ref{whycomm}.
\end{remark}
