\section{PRELIMINARIES}

% \textbf{Notations. }
In this paper, we let $[t] = \{1,...,t\}$, $\Vert \x \Vert$ denotes the Euclidean norm, $\Vert \x \Vert_{\V} = \sqrt{\x^\top \V \x}$ denotes the matrix norm, $\log$ denotes the natural logarithm, $\log_2$ denotes the binary logarithm, $\bI\in \R^{d\times d}$ denotes the identity matrix, $\bold{0}$ denotes the $d$-dimension zero vector or $d\times d$-dimension zero matrix, $\text{det}(\V)$ denotes the determinant of the matrix $\V\in\R^{d\times d}$ and $\V^\top$ denotes the transpose of $\V$. Besides, we utilize $x = \Omega(y)$ to denote that there exists some constant $C > 0$ such that $Cy \le x$, $x = O(y)$ to denote that there exists some constant $C^\prime$ such that $C^\prime y \ge x$, and $\tilde{O}$ to further hide poly-logarithmic terms.

\subsection{Federated Bandits}\label{section2.1}

\textbf{MAB}
We consider the federated asynchronous MAB (similar to \cite{Li2021AsynchronousUC,He2022ASA}) as follows. There exists a set $\M = \{m\}_{m=1}^M$ of $M$ agents ($M \ge 2$), a central server and a environment $\A =\{ k \}_{k=1}^K$ with $K$ arms ($K \ge 2$). In each round $t$, an arbitrary agent $m_t \in \M$ becomes active, pulls an arm $k_{m_t,t} \in \A$, and receives reward $r_{m_t,t}$. The reward of each arm $k\in\A$
follows a $\sigma$-sub-Gaussian distribution with mean $\mu(k)$. Similar to the other papers that studied the pure exploration \citep{Gabillon2012BestAI,Du2021CollaborativePE}, we suppose the best arm $k^* = \arg\max_{k\in\A} \mu(k)$ to be unique.

\textbf{Linear bandits }
Different from the MAB, in the federated asynchronous linear bandits \citep{Li2021AsynchronousUC,He2022ASA}, every arm $k$ is associated with a context $\x_k \in \R^d$. In round $t$, if the active agent $m_t \in \M$ pulls an arm $k_{m_t,t} \in \A$, it would receive reward $r_{m_t,t} = \x_{m_t,t}^\top \t^* + \eta_{m_t,t}$, where $\t^*\in \R^d$ is the unknown model parameter and $\eta_{m_t,t} \in \R$ denotes the conditionally $\sigma$-sub-Gaussian noise (more details are provided in Lemma \ref{auxlemma7} in the appendix). Without loss of generality, we suppose $\Vert\x_k\Vert \le 1$, $\forall k\in\A$, $\Vert\t^*\Vert \le 1$ and the best arm $k^* = \arg\max_{k\in\A} \x_k^\top\t^*$ to be unique. 

\subsection{Learning Objective}
This paper focuses on the fixed confidence $(\epsilon,\delta)$-pure exploration problem. The goal of the bandit algorithm is to find an estimated best arm $\hat k^* \in \A$ which satisfies
\begin{align}\label{1}
    \bP(\Delta(k^{*}, \hat k^{*}) \le \epsilon) \ge 1 - \delta
\end{align}
with minimum sample complexity. The reward gap parameter satisfies $0\le\epsilon<1$ and the probability parameter satisfies $0< \delta < 1$. The expected reward gap between arm $i$ and $j$ in the MAB and linear bandits are denoted as $\Delta(i,j) = \mu(i) - \mu(j)$ and $\Delta(i,j) = \y(i,j)^\top \t^*$, respectively, where $\y(i,j) = \x_{i} - \x_{j}$ denotes the difference between contexts. 
The sample complexity is defined as the agents' total number of interactions with the environment, which is denotes as $\tau$. 

\subsection{Communication Model and Asynchronous Environment}
\textbf{Communication model }
In this paper, we consider a star-shaped communication network \citep{Wang2019DistributedBL,Li2021AsynchronousUC,He2022ASA}, where every agent can only communicate with the server and can not directly communicate with other agents. We define the communication cost $\C(\tau)$ as the \emph{total number of times} that agents upload data to the server and download data from the server in total $\tau$ rounds \citep{Dubey2020DifferentiallyPrivateFL,Li2021AsynchronousUC,He2022ASA}, i.e.,
\begin{align} \label{2}
\begin{split}
    \C(\tau) = &\sum_{t=1}^\tau \bone\{ m_t\ \text{uploads data to the server} \} \\&+ \bone\{m_t\ \text{downloads data from the server}\}.
\end{split}
\end{align}

\textbf{Asynchronous environment } 
Similar to \cite{He2022ASA,li2023learning}, in the asynchronous environment, there is only one active agent $m_t$ (can be an arbitrary agent in $\M$) that interacts with the environment in each round $t$. Besides, except for the initialization steps, only the active agent is allowed to communicate with the server, i.e., independent from other offline agents. 

% \begin{remark} [The reasonability of the asynchronous setting]
In our setting, it's important to clarify that the variable $t$ specifically represents the round index, indicating the sequence in which agents engage in the bandit problem. Importantly, it doesn't refer to the actual time of agent involvement. Even when multiple agents are involved, such as in data exchange with the server, there remains a discernible order among these participation events within a very short time frame. This means that even if two events occur very close together in time, a distinct sequence is maintained. As a result, agent participation happens sequentially, based on the index $t$. Our context has a broader scope compared to previous studies on pure exploration federated bandits \citep{Hillel2013DistributedEI, Du2021CollaborativePE, Reddy2022AlmostCC}. This difference arises because those settings require all agents to fully participate in each round, while our setting allows for partial participation, allowing any subset of agents to be involved.
% \end{remark}

\textbf{Motivated example} We here provide a piratical example for asynchronous federated pure exploration. Let's consider a sequential experimental design problem, e.g., for drug discovery or chemical synthesis, where our goal is to identify an arm that is $\epsilon$-near optimal (i.e., chemical with desired properties) with high probability. In this problem, we are not concerned about cumulative regret (i.e., the quality of the chemicals tried during the online learning process); instead, we only care about whether we can find the optimal arm in the end, and the corresponding sample complexity and communication cost due to their expensive nature (see the introduction in \cite{Hillel2013DistributedEI,Reda2022NearOptimalCL,Du2021CollaborativePE} for details). Additionally, each laboratory lacks samples (i.e., funding for resource) to complete the task individually, so we need to involve multiple labs to collaborate on the learning task. These requirements motivate people to study federated pure exploration problems. Besides, previous synchronous federated pure exploration algorithms assume every agent (i.e., lab) should participate in the exploration (i.e., do the experiment) in each round and the server can force all the agents to upload their data in synchronization rounds. This is impractical due to some agents may get offline (e.g., they run out of resources), and all other agents should wait until they get online (e.g., collect enough resources), this will significantly reduce the learning speed (see the introduction in \cite{Li2021AsynchronousUC,He2022ASA,li2023learning} for details). In this paper, we propose two asynchronous federated pure exploration algorithms that do not rely on synchronous assumptions and are more practical for real-world applications.