\documentclass[12pt]{article}
\usepackage{changes}
\usepackage{cite}
\usepackage{amsmath,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{indentfirst}
\usepackage{CJK}
\usepackage{graphicx}
\usepackage{epstopdf}
\usepackage{flushend}
\usepackage{balance}
\usepackage{lettrine}
\usepackage{flushend,cuted}
\usepackage{blindtext}
\usepackage[caption=false,font=normalsize,labelfont=sf,textfont=sf]{subfig}
\usepackage{amsmath,amsfonts}
\usepackage{algorithm}
\usepackage{array}
\usepackage{booktabs}
\usepackage{balance}

\allowdisplaybreaks[4]
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}

\begin{document}

\textbf{{\color{blue}Reply to Reviewer: WD3P}}
\\
\\
We thank the reviewer for acknowledging the novelty of our problem and the presentation of the theoretical results.
% We thank you for your kind and helpful questions. The answers are provided as follows.
We respond to the reviewers' questions in the following.
\\
\\
**Q1** *In Figure 1 (b), we cannot really see the advantage of the proposed method for linear bandits, because different from Figure 1 (a), the communication cost of the proposed method is also larger than the method with synchronous communication in most of the cases.*
\\
\\
**A1** *Thank you for your insightful comment. 
% The reason that the synchronous baseline outperform the \texttt{FALinPE} can be attributed to 
We want to clarify that the asynchronous setting is intrinsically more difficult compared with its synchronous counterpart, which is also acknowledged in prior works studying regret minimization [He et. al., 2022; Li et al., 2021, 2023], i.e., asynchronous algorithms typically incur larger communication cost than synchronous ones under the same regret guarantee. 
Therefore, the inclusion of synchronous algorithms' mainly serves as a reference showing the performance under the easier synchronous setting. 
Specifically, the synchronous environment assumes that every agent participates in exploration in each round and that the server can initiate a global communication round. These assumptions ensure that the server receives global information in each synchronization round, enabling it to effectively exploit all data the agents have and achieve a lower sample complexity. However, in the asynchronous environment, such a global communication round is not feasible and there is no guarantee on when or whether an agent would become active. Hence, the server cannot achieve the global information unless the agents communicate with it in each round (which results in $2\tau$ communication cost). 
% We also need to mention that in [He et. al., 2022; Li et al., 2021, 2023], their asynchronous regret minimization algorithms are also outperformed by the synchronous baselines.

Besides, as highlighted in the introduction, the efficacy of existing federated synchronous pure exploration algorithms hinges on the strong assumptions of synchronous communication. In the case where these assumptions do not hold, e.g., due to existence of stragglers in real-world systems, synchronous communication becomes ineffective as it needs to wait for the slowest agent to respond. The main advantage of our \texttt{FALinPE} is its ability to achieve near-optimal performance in an asynchronous environment, a feat not achievable by any synchronous algorithms.

We will add this discussion to the revised version of our paper.*
\\
\\
**Q2** *If I understand correctly, for both multi-armed bandits and linear bandits, the algorithm results from incorporating pure exploration algorithms for standard bandits into federated bandits. So I think it would be good to clearly discuss what are the main resulting technical challenges in terms of both the design of the algorithms and the analysis.*
\\
\\
**A2** *Thanks for your comment. The main challenge addressed in our paper is to design federated pure exploration algorithms that can work in the asynchronous environment. We here highlight our contributions and the technical challenges. 

1. A key challenge in conducting pure exploration via asynchronous communication is the absence of dedicated synchronous communication rounds where the server can assign arms to be explored by each agent based on their latest observations. Moreover, there is no guarantee on when or whether an agent would become active again to execute the exploration and report its observations back. This severely hinders the applicability of all existing distributed/federated pure exploration algorithms, whose exploration strategies are based on solving optimal experimental design. To address this challenge, we adopt a fully adaptive exploration strategy, such that each agent separately and asynchronously decides which arm to pull, based on the statistics received from the server in its latest communication.

2. As discussed in the \textbf{design of the communication event} in Section 4, a technical challenge we addressed is that neither the agents nor the server have access to the global observation number (i.e., time index $t$). Consequently, we cannot directly employ the time index $t$ to establish exploration bonuses in the asynchronous environment. Previous works by [Li et. al., 2021, 2023; He et. al., 2022] encountered similar constraints, but they suppose the time horizon $T$ of the regret minimization problem is known and can utilize $T$ to establish the exploration bonus. However, the time horizon $\tau$ in the fixed confidence pure exploration problem is unavailable. Hence, we design an upper bound for $t$ by leveraging the triggered event (i.e., event $2$ in the hybrid triggered strategy) to devise the exploration bonus. In Lemma $4$ and the proof of Lemma $2$, we demonstrate that $t = \sum_{k=1}^K T_{ser,t}(k) + \sum_{m=1}^M \sum_{k=1}^K T^{loc}_{m,t}(k) \le (1+\gamma M)\sum_{k=1}^K T_{ser,t}(k)$. Subsequently, we substitute $(1+\gamma M)\sum_{k=1}^K T_{ser,t}(k)$ and $(1+\gamma M)\sum_{k=1}^K T_{m_t,t}(k)$ into the exploration bonuses of the server and $m_t$ to replace $t$.

3. The exploration bonuses in the linear bandit are not only related to $t$ but also related to covariance matrices. Therefore, different from the communication protocol in the MAB, the event-triggered communication protocol in the linear bandits is additionally required to keep $\bold{V}_{m_t,t}$ and $\bold{V}_{ser,t}$ in a desired proportion to the global covariance matrix $\lambda\bold{I} + \sum_{s=1}^t \mathbf{x}_{m_t,t}\mathbf{x}_{m_t,t}^\top$.
Based on this requirement, we propose a hybrid event-triggered strategy that can simultaneously control the size of $\bold{V}^{loc}_{m,t}$, $\sum_{k=1}^KT^{loc}_{m,t}(k)$ and the exploration bonuses. Our proof shows that the hybrid event-triggered communication protocol can also achieve a low communication cost compared with asynchronous regret minimization algorithms for linear bandits [He et al., 2022; Li et al., 2021, 2023].

We will include the discussion in the revised version of our paper.*
\\
\\
% Once again, we thank the reviewer for the kind and helpful questions, they really guide us to improve the quality of the paper.
\\
\\
**Reference**

[1] He, J., Wang, T., Min, Y., $\And$ Gu, Q. (2022). A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits. ArXiv, abs/2207.03106.

[2] Li, C., $\And$ Wang, H. (2021). Asynchronous Upper Confidence Bound Algorithms for Federated Linear Bandits. ArXiv, abs/2110.01463.

[3] Li, C., Wang, H., Wang, M., $\And$ Wang, H. (2023). Learning Kernelized Contextual Bandits in a Distributed and Asynchronous Environment. International Conference on Learning Representations.
\\
\\
\textbf{{\color{blue}Reply to Reviewer: DWwV}}
\\
\\
% Thanks for the helpful and kind suggestions. We provide the answers as follows.
We thank the reviewer for appreciating our writing and presentation of the technical results. In the following we respond to the reviewer’s questions.
\\
\\
**Q1** *The main weakness of this paper is perhaps the connection of this work to the general field or audience. It is somewhat hard to understand the level of contribution of their algorithms to the community as a non-expert in federated learning. Adding some discussion about e.g. the advantages of federated learning over standard BAI in bandits would be good.*
\\
\\
**A1** *Thank you for your valuable suggestion. 
% It greatly aids in enhancing the clarity of our paper. 
Here, we outline the advantages of federated asynchronous pure exploration, which we will incorporate into revised version of our paper.

1. Federated pure exploration algorithms can accelerate the learning process. When employing single-agent pure exploration algorithms (such as UGapEc or LinGapE) independently on $M$ agents without communication, the sample complexity becomes $O(M)$ times larger than that of our algorithms. This suggests that FAMABPE can expedite the learning process by a factor of $O(M)$.

2. Federated pure exploration addresses challenges beyond the capabilities of single-agent pure exploration. Consider a scenario where individual agents lack sufficient samples (e.g., funding or resources) to accomplish fixed-confidence pure exploration tasks independently. In the federated pure exploration setting, by involving an adequate number of agents and utilizing federated pure exploration algorithms, we can effectively tackle such problems.

3. Our asynchronous algorithms offer higher practicality compared to their synchronous counterparts. Existing federated pure exploration algorithms are typically confined to synchronous settings [Hillel et al., 2013; R'eda, 2022; Du et al., 2021], wherein all agents are forced to upload their local data to the server upon request. Subsequently, agents download the latest data from the server after all uploads are completed. However, this requires full agent participation and global synchronization mandated by the server, making it impractical for many real-world application scenarios. In contrast, our asynchronous federated pure exploration algorithms can alleviate these constraints:

1) Each agent can decide whether to participate in each round. Full participation isn't obligatory, thus accommodating temporarily offline agents; and 2) communication between each agent and the server is asynchronous and completely independent of other agents. There's no need for global synchronization or mandatory coordination by the server.*
\\
\\
**Q2** *In experiments, it seems that the authors only compare their algorithms to single-agent and synchronous algorithms and it is no surprise that the proposed algorithm performs sub-optimally against them. Is there any asynchronous benchmark that the authors could compare to, say, e.g. asynchronous algorithms for regret minimization?*
\\
\\
**A2** *Thank you for your question. To the best of the author's knowledge, our proposed algorithms are the first to address pure exploration of asynchronous federated bandits. Additionally, existing works on regret minimization of asynchronous federated bandits [He et al., Li et al., 2022, Li et al., 2023] primarily concentrate on minimizing cumulative regret over $T$ iterations. In contrast, our algorithms provide $i_{ser,\tau}$ as the estimated best arm, while their algorithms lack a decision rule to output an estimated best arm. Therefore, we can not directly compare our results to theirs.*
\\
\\
**Reference**

[1] Hillel, E., Karnin, Z.S., Koren, T., Lempel, R., $\And$ Somekh, O. (2013). Distributed Exploration in Multi-Armed Bandits. ArXiv, abs/1311.0800.

[2] R'eda, C., Vakili, S., $\And$ Kaufmann, E. (2022). Near-Optimal Collaborative Learning in Bandits. ArXiv, abs/2206.00121.

[3] Du, Y., Chen, W., Kuroki, Y., $\And$ Huang, L. (2021). Collaborative Pure Exploration in Kernel Bandit. ArXiv, abs/2110.15771.

[4] He, J., Wang, T., Min, Y., $\And$ Gu, Q. (2022). A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits. ArXiv, abs/2207.03106.

[5] Li, C., $\And$ Wang, H. (2021). Asynchronous Upper Confidence Bound Algorithms for Federated Linear Bandits. ArXiv, abs/2110.01463.

[6] Li, C., Wang, H., Wang, M., $\And$ Wang, H. (2023). Learning Kernelized Contextual Bandits in a Distributed and Asynchronous Environment. International Conference on Learning Representations.
\\
\\
\textbf{{\color{blue}Reply to Reviewer: 9j35}}
\\
\\
We thank you for your helpful and insightful questions. The answers are provided as follows.
\\
\\
**Q1** *I am not quite clear about the learning objective in the context of asynchronous federated bandits. Looking at previous research, such as by He et al. [2022] and Li et al. [2023], the primary learning objective appears to still be regret. I think that in this respect, the authors need to provide some concrete examples to illustrate its practical significance.*
\\
\\
**A1** *
% Thank you for the question. 
% it really help us to improve the clarity of the paper. 
The learning objective of the federated fixed confidence pure exploration problem studied in this paper is to identify an $\epsilon$-optimal best arm with high probability, with communication cost and sample complexity being as low as possible.

Here is a practical example. Let's consider a sequential experimental design problem, e.g., for drug discovery or chemical synthesis, where our goal is to identify an arm that is $\epsilon$-near optimal (i.e., chemical with desired properties) with high probability. In this problem, we are not concerned about cumulative regret (i.e., the quality of the chemicals tried during the online learning process); instead, we only care about whether we can find the optimal arm **in the end**, and the corresponding sample complexity and communication cost due to their expensive nature (see the introduction in [Hillel et al., 2013; R'eda, 2022; Du et al., 2021] for details). Additionally, each laboratory lacks samples (i.e., funding for resource) to complete the task individually, so we need to involve multiple labs to collaborate on the learning task. These requirements motivate people to study federated pure exploration problems. Besides, previous synchronous federated pure exploration algorithms assume every agent (i.e., lab) should participate in the exploration (i.e., do the experiment) in each round and the server can force all the agents to upload their data in synchronization rounds. This is impractical due to some agents may get offline (e.g., they run out of resources), and all other agents should wait until they get online (e.g., collect enough resources), this will significantly reduce the learning speed (see the introduction in [He et al., Li et al., 2022, Li et al., 2023] for details). Our asynchronous federated pure exploration algorithms can alleviate these assumptions: 1) Each agent can decide whether to participate in each round. Full participation isn't obligatory, thus accommodating temporarily offline agents; and 2) communication between each agent and the server is asynchronous and completely independent of other agents. Based on our discussion, we believe our asynchronous federated pure exploration algorithms are more practical than the previous synchronous federated pure exploration algorithms.

We will add this example in the future version of our paper.*
\\
\\
**Q2** *I have a question regarding the problem setting: does the actual objective here consist of minimizing the sample complexity as well as minimizing the communication cost? Also, does the term "asynchronous" imply that in each round, after an agent pulls an arm and receives feedback, it can choose whether to upload the data or not? Does this mean that some data, if deemed not very significant, can be chosen not to be uploaded and used?*
\\
\\
**A2** *Thank you for the insightful questions.

1. Yes. We answer this question in **A1**.

2. Yes. In our setting, the server cannot compel agents to participate in communication rounds. Active agents have the discretion to decide whether to communicate with the server or not. In our algorithms, we enable the active agent to communicate with the server only when a communication event is triggered; otherwise, the active agent refrains from uploading its data to the server. Besides, in the \textbf{design of communication event} of Section 4, we also mention that some agents may possess data that has not been uploaded to the server when the algorithm is terminated.

We will enhance our paper's presentation based on this discussion.*
\\
\\
Once again, we thank you for these helpful questions. They really help us improve the clarity of the paper.
\\
\\
**Reference** *

[1] Li, C., $\And$ Wang, H. (2021). Asynchronous Upper Confidence Bound Algorithms for Federated Linear Bandits. ArXiv, abs/2110.01463.

[2] He, J., Wang, T., Min, Y., $\And$ Gu, Q. (2022). A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits. ArXiv, abs/2207.03106.

[3] Xu, L., Honda, J., $\And$ Sugiyama, M. (2017). Fully adaptive algorithm for pure exploration in linear bandits. arXiv: Machine Learning.
*
\\
\\
Once again, we thank you for all the insightful questions you provided, they really help us improve our paper.
\\
\\
\textbf{{\color{blue}Reply to Reviewer: DTya}}
\\
\\
We thank you for your kind and insightful comments and suggestions. The answers are provided as follows.
\\
\\
**Q1** *Why the communication costs (in both cases) do not dependent on the confidence level $\delta$? For example, in eq (18), it is shown that $C(\tau) \le 2(M + 1/\gamma)\log(\tau)$. However, since $\tau$ depends on $\delta$ (with order $\log(1/\delta)$), the communication cost should also depend on $\delta$ (with order $\log\log(1/\delta)$).*
\\
\\
 **A1** *Thank you for your insightful question. In Theorem $1$, we bound the sample complexity of \texttt{FAMABPE} as $\tau = O(H_\epsilon^M \log(H_\epsilon^M/\delta) )$ and the sample complexity as $C(\tau) = 2(M + 1/\gamma)\log(\tau)$. By substituting the first bound into the second, we derive $C(\tau) = O((M + 1/\gamma)\log(H_\epsilon^M \log(H_\epsilon^M/\delta)))$, which is related to $\delta$ with an order of $\log(\log(1/\delta))$. The same reasoning applies to Theorem $2$. Our current version omits the $\log(\log(.))$ term for the sake of simplicity in expression. We will add the $\delta$-dependent results to the future version of our paper.*
\\
\\
 **Q2** *In \texttt{FALinPE}, why we require a hybrid event-triggered strategy, specifically, why we want to upload/download if the local observation number is higher than $\gamma_2$ times the server observation number? I guess it is used to control the term $(1 + M\gamma_2)$ in the confidence radius. However, since this term is in $\log$, could we do this kind of upload/download in a much less frequency? Maybe this can decrease the communication cost?*
\\
\\
**A2** *Thank you for your helpful question. We here offer a detailed explanation of why our algorithm needs a hybrid event-triggered communication protocol in the linear case.

There are two reasons that we need to design the communication protocol with event $2$.

The first one is neither the agents nor the server have access to the global observation number (i.e., time index $t$). Consequently, we cannot directly employ the time index $t$ to establish exploration bonuses in the asynchronous environment. Previous works by [Li et al., 2021, 2023, He et al., 2022] encountered similar constraints, but they managed to utilize the time horizon $T$ to establish the exploration bonus. However, the time horizon $\tau$ in the fixed confidence pure exploration problem is unavailable. Hence, we design an upper bound for $t$ by leveraging the triggered event (i.e., event $2$ in the hybrid triggered strategy) to devise the exploration bonus. In Lemma $4$ and the proof of Lemma $2$, we demonstrate that $t = \sum_{k=1}^K T_{ser,t}(k) + \sum_{m=1}^M \sum_{k=1}^K T^{loc}_{m,t}(k) \le (1+\gamma M)\sum_{k=1}^K T_{ser,t}(k)$. Subsequently, we substitute $(1+\gamma M)\sum_{k=1}^K T_{ser,t}(k)$ and $(1+\gamma M)\sum_{k=1}^K T_{m_t,t}(k)$ into the exploration bonuses of the server and $m_t$ to replace $t$. 

The second reason is that when the server terminates the algorithm, some agents may possess data that has not been uploaded to the server. We wish the amount of these data to be small compared with the sample complexity $\tau$ since they have no contribution to identifying the estimated best arm. The event $2$ can efficiently limit the number of these useless samples.

Besides, different from the exploration bonus in the MAB, the exploration bonus in the linear bandits is additionally related to the covariance matrix, and we need to keep $\bold{V}_{m_t,t}$ and $\bold{V}_{ser,t}$ in a desired proportion to the global covariance matrix $\lambda\bold{I} + \sum_{s=1}^t \mathbf{x}_{m_t,t}\mathbf{x}_{m_t,t}^\top$.
Based on this requirement, the first event is proposed to control the size of $\bold{V}^{loc}_{m,t}$ and $\bold{V}_{ser,t}$, the result is shown in Lemma $9$.

The above detailed discussion will be added to the future version of our paper.*
\\
\\
**Q3** *In your experiments (a), why the communication cost of FAMABPE almost remains the same under different sample complexity $\tau$?*
\\
\\
**A3** *Thank you for your question. We think the reason is that the communication cost is only logarithmically related to the sample complexity $\tau$ (i.e., $C(\tau) = O((1 + 1/\gamma)\log(\tau))$). Here is the raw data of the communication cost of FAMABPE in Fig 1 (a):

Here is the data.

From the data above, it is evident that as the sample complexity decreases gradually, the corresponding communication cost also decreases gradually. We will annotate each data point in Fig 1 (a) in the new version of the paper.*
\\
\\
Once again, thank you very much for proposing these insightful and helpful suggestions and questions, they really help us to improve the quality of the paper!
\\
\\
**Reference**

[1] He, J., Wang, T., Min, Y., $\And$ Gu, Q. (2022). A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits. ArXiv, abs/2207.03106.

[2] Li, C., $\And$ Wang, H. (2021). Asynchronous Upper Confidence Bound Algorithms for Federated Linear Bandits. ArXiv, abs/2110.01463.

[3] Li, C., Wang, H., Wang, M., $\And$ Wang, H. (2023). Learning Kernelized Contextual Bandits in a Distributed and Asynchronous Environment. International Conference on Learning Representations.
\\
\\
\textbf{\color{blue}Reply to the reviewer: Mvsf}
\\
\\
We thank you for your insightful comments and suggestions. The answers are provided as follows.
\\
\\
**Q1** *Your empirical performance is not impressive, and how this proposal is useful in practice or industry is unclear, which should be clearly articulated.*
\\
\\
**A1** *Thank you for the suggestion. It really helps us to improve the presentation of our paper. We here propose a practical example of our algorithms. 

Let's consider a sequential experiment design problem where our goal is to identify an experiment that is $\epsilon$-near optimal with high probability. In this problem, we are not concerned about the regret (i.e., the cumulative experiment performance); instead, we focus solely on the sample complexity and communication cost due to their expensive nature (see the introduction in [Mitra, et. al., 2021; Du et. al., 2021] for details). Additionally, each laboratory lacks samples (i.e., funding for resource) to complete the task individually, so we need to involve multiple labs to collaborate on the learning task. These requirements motivate people to study federated pure exploration problem. Besides, previous synchronous federated pure exploration algorithms assume every agent (i.e., lab) should participate in the exploration (i.e., do the experiment) in each round and the server can force all the agents to upload their data in synchronization rounds. This is impractical due to some agents may get offline (e.g., they are run out of resources), and all other agents should wait until they get online (e.g., collect enough resources), this will significantly reduce the learning speed (see the introduction of [He et. al., 2022; Li et. al., 2021, 2023] for details). Our asynchronous federated pure exploration algorithms can alleviate these assumptions: 1) Each agent can decide whether to participate in each round. Full participation isn't obligatory, thus accommodating temporarily offline agents; and 2) communication between each agent and the server is asynchronous and completely independent of other agents. Based on our discussion, we believe our asynchronous federated pure exploration algorithms are more practical than the previous synchronous federated pure exploration algorithms.

We will add this example in the future version of our paper.*
\\
\\
**Q2** *
In your experiments, the data scale is too small and you don't have large-scale and real-world or production data based experimental results to support your claims which is one of the main drawbacks of this work.*
\\
\\
**A2** *Thank you for your comment and suggestion. The choice of arm numbers (i.e., $K = 5$ and $10$) in our experimental section aligns with common practice in other papers [Mitra et. al., 2021; Du et al., 2021]. A production system like recommender system usually contains two or more stages where first stage will filter out most of the low reward arms and leaves tens or hundreds of candidates, so that the later stages focus solely on identifying the best arm among those promising ones (with small reward gap). 
Additionally, we note that we present experimental results using real-world MovieLens dataset for the linear case in  Appendix A. We intend to integrate this result into the main paper in final version with additional pages allowed.*
\\
\\
**Q3** *This manuscript studies asynchronous federated pure exploration algorithms, with the potential to incorporate decentralised bandits, there are related state-of-the-art you should compare: Fast Distributed Bandits for Online Recommendation Systems, Distributed Online and Bandit Convex Optimization*
\\
\\
**A3** *Thank you suggesting the relevant papers. We will add them to our reference.

The paper "Fast Distributed Bandits for Online Recommendation Systems" introduces a novel distributed bandit-based algorithm \texttt{DistCLUB}. This algorithm lazily forms clusters in a distributed manner, substantially reducing the need for network data sharing and achieving high scalability. 
Besides, the paper "Distributed Online and Bandit Convex Optimization" aims to
minimize regret on $M$ machines working in parallel over $T$ rounds with $R$ intermittent communication budget and bandit feedback. However, these algorithms can only work in the synchronous environment, in our paper, we consider the more general asynchronous environment and propose two algorithms that can achieve near-optimal theoretical performance in such environment.

The above discussion will also be added to the future version of our paper.*

\\
\\
**Reference**

[1] Du, Y., Chen, W., Kuroki, Y., $\And$ Huang, L. (2021). Collaborative Pure Exploration in Kernel Bandit. ArXiv, abs/2110.15771.

[2] Mitra, A., Hassani, H., $\And$ Pappas, G. (2021). Exploiting Heterogeneity in Robust Federated Best-Arm Identification. ArXiv, abs/2109.05700.

[3] He, J., Wang, T., Min, Y., $\And$ Gu, Q. (2022). A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits. ArXiv, abs/2207.03106.

[4] Li, C., $\And$ Wang, H. (2021). Asynchronous Upper Confidence Bound Algorithms for Federated Linear Bandits. ArXiv, abs/2110.01463.

[5] Li, C., Wang, H., Wang, M., $\And$ Wang, H. (2023). Learning Kernelized Contextual Bandits in a Distributed and Asynchronous Environment. International Conference on Learning Representations.
\end{document}
