\section{Introduction}


\begin{table*}[tp]
\centering
\begin{threeparttable}
    \caption{A comparison summary of prior literature and this work.}
    \label{tab: compare-to-prior-literature}
    \begin{tabular}{|l|ll|}
      \hline
      \textbf{Algorithm}
       & \textbf{Individual Regret}
       & \textbf{Group Regret}
      \\ \hline
      \texttt{Gossip\_UCB}~\citep{zhu2021federated}
       & \(O(\sum_{i:\Delta_i>0} N\Delta_i^{-1}\log T)\)
       & \(O(\sum_{i:\Delta_i>0} N^2\Delta_i^{-1}\log T)\)
      \\
    \texttt{Dis\_UCB}~\citep{zhu2023distributed}
       & \(O(\sum_{i:\Delta_i>0} N^{-1}_{\min}\Delta_i^{-1}\log T)\)
       & \(O(\sum_{i:\Delta_i>0}N N^{-1}_{\min}\Delta_i^{-1}\log T)\)
      \\
     \hline\hline
     \rowcolor[gray]{.9}
      {\texttt{DRRB-bandit} (our work)}
       & \(O(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)\)
       & \(O(\sum_{i:\Delta_i>0}\Delta_i^{-1}\log T)\)
      \\
      \rowcolor[gray]{.9}
      {General regret lower bound}
       & \(\Omega(\sum_{i:\Delta_i>0}N^{-2}\Delta_i^{-1}\log T)\)
       & \(\Omega(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)\)
      \\
      \rowcolor[gray]{.9}
      {Regret lower bound for special algorithms}
       & \(\Omega(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)\)
       & \(\Omega(\sum_{i:\Delta_i>0}\Delta_i^{-1}\log T)\)
      \\
      \hline
    \end{tabular}
\end{threeparttable}
\end{table*}


Online learning problems in federated settings, where a set of agents complete a common learning task via performing individual learning algorithms and keeping data locally used, are broadly researched due to plenty of motivating applications in the real world. For example, in the fields of finance, medicine and data processing, federated learning is a potential method for solving local training and individual privacy problems \citep{yang2019federated,li2020review,liu2022distributed}.
In this paper, we study the FMAB problem, where multiple instances of the MAB problem are implemented on a set of agents communicating with each other. 
Recently, efforts have been invested in designing distributed bandit algorithms for federated learning problems \citep{feraud2019decentralized,shi2021federated,agarwal2022multi}, where agents can only communicate with neighbors without a suitable end-to-end communication protocol due to the limitations in practical systems. 
FMAB with consensus communication has many real-world applications. For instance, it is common for multiple agents to collaborate on large-scale tasks in broadcasting sensor networks, which consist of several wireless sensors that communicate only with their neighbors \citep{li2019optimal,kolla2018collaborative}. 
For example, selecting an appropriate time to conduct an outdoor experiment requires consideration of various environmental factors such as humidity, temperature, wind speed, and others. To capture this information, a variety of sensors are deployed, viewed as agents in this context. At different time steps, these agents provide feedback based on their local observations, which serve as local samples. The ultimate objective is to integrate these local samples to identify the optimal time for the outdoor experiment. In other scenarios, data heterogeneity may arise from privacy protection policies, which require training data to be processed locally.

The major obstacles that prevent FMAB from achieving optimal learning performance are heterogeneous feedback among agents and fully distributed communication. Together, these factors make it difficult for agents to accurately track the global mean.
To effectively learn the global mean, the implemented learning algorithm needs to collect the estimates or observations in each agent, as well as their number of samples. One can imagine that simply merging the global estimate of each agent without knowing the number of samples will result in unexpected bias. 
Furthermore, fully distributed communication indicates that each agent has a different ability to acquire information, leading to errors in tracking sample counts.
These errors accumulate over time, ultimately degrading the performance of FMAB algorithms.
As a result, all previous works have failed to achieve optimal regrets. 

In the presence of heterogeneous feedback, a distributed estimation approach has been proposed. This approach primarily collects information from neighboring agents and estimates the global mean of each arm to determine whether the selection is optimal. However, because of the absence of a central server in a fully distributed communication setup, real-time information remains inaccessible to individual agents.
According to the above explanation, two obstacles are coupled, i.e., fully distributed communication makes heterogeneous feedback more difficult to deal with. To address the FMAB problem with only hop-by-hop communication, a kind of gossip-based communication was proposed \citep{kempe2003gossip,boyd2006randomized}, which does not need a central server or a fully connected communication graph. These methods eliminate the need for a central server or a fully connected communication graph, allowing agents to exchange information efficiently.
Building on foundational research in communication methods, several scholars have developed algorithms to tackle the FMAB problem \citep{zhu2021federated, zhu2023distributed, xu2024decentralized}.
These studies employed gossip-based communication and refined the selection strategy using the Upper Confidence Bound (UCB) algorithm \citep{auer2010ucb}. Additionally, they introduced mechanisms to regulate agent behavior, ensuring consensus on sampling frequency across the network.

However, due to inherent limitations in the framework of UCB-based algorithms, certain challenges remain in the proposed approach, potentially leading to suboptimal results. Among those, the core challenge is to obtain unbiased global estimates from biased local observations and limited information from neighbors. Specifically, traditional UCB-based algorithms tend to select the arm with the maximum upper confidence bound, without accounting for the heterogeneity of agents. This can result in biased estimates of reward means, thereby leading to suboptimal regret performance. While previous works~\citep{zhu2021federated,zhu2023distributed,xu2024decentralized} propose estimation mechanisms to address this issue, the weight assigned to each reward in the global estimate varies. This inconsistency creates an unfair mechanism, leading to suboptimal convergence of the estimates.

\paragraph{Related works.} The federated bandit problem can be divided into two categories from the perspective of reward, called homogeneous reward and heterogeneous reward settings. In the homogeneous reward setting \citep{hillel2013distributed, wang2019distributed, wang2020optimal}, agents pull the same arm and achieve rewards from the same distribution, which implies that their sampling directly helps them estimate the global means of the arms. In the heterogeneous reward setting \citep{shi2021federated, zhu2021federated, zhu2023distributed, xu2024decentralized}, agents have their local reward distributions, which means that agents obtain different rewards even if they pull the same arm. The main challenge is to obtain an unbiased global estimate because local sampling is useless to learn the global mean. 

From the classification of the communication network, the network can be divided into the fully distributed graph and the fully connected graph \citep{shamma2008cooperative}. A fully connected graph means that any two agents are directly connected, while a fully distributed graph, means that there is a path between two agents that may be connected through other agents. For a fully connected graph (end-to-end communication), the time delay is the $1$ time slot, which can be seen as a central server because all agents have access to all agents' information. For a fully distributed graph (hop-by-hop communication), the time delay is at most $D$ time slots, which is the diameter of the communication graph.

From the perspective of the sampling method, it can be divided into synchronous setting \citep{wang2019distributed, dubey2020differentially, huang2021federated} and asynchronous setting \citep{he2022simple, wang2023pure}. In the synchronous setting, each agent samples and communicates at the same frequency, which makes concentration easier than with asynchronous methods. 
In an asynchronous setting, agents can not implicitly coordinate their actions through time,  making it difficult to cooperate. Agents pull arms and exchange information without a common rule, while each agent could act in their own interests.

\paragraph{Contributions.} In the article, we investigate the above-mentioned federated bandit learning problem and make the following contributions. 

In Section~\ref{Section: algorithms}, we propose two algorithms called $\texttt{CES}$ and $\texttt{DRRB-bandit}$. (a)~$\texttt{DRRB-bandit}$ leverages the strategy of round-robin sampling to ensure the agents' samples in a synchronized manner \citep{wang2023achieve,perchet2013multi}. 
Specifically, the agents communicate with one another to maintain a consistent candidate set and explore the arms within that set in a round-robin manner.
(b)~$\texttt{CES}$ uses a novel estimation mechanism, which is first presented in the literature, to estimate the global mean of each arm. \texttt{CES} combines the global estimates from other agents with its own latest samples in a dynamic proportion, even when agents are not directly connected.
This mechanism fairly allocates the weight of each sample reward in the global estimates, effectively mitigating the effects of heterogeneity and producing a more accurate global estimate.

In Section~\ref{Section: analysis}, through theoretical analysis, \texttt{DRRB-bandit} is proved to achieve a near-optimal individual regret $O(\sum_{i:\Delta_i>0} N^{-1}\Delta_i^{-1}\log T)$ for each agent, where \(N\) is the number of agents, \(\Delta_i\) is the gap between the optimal global mean and the global mean of arm $i$, and \(T\) is the time horizon. As a straightforward result, the group regret is $O(\sum_{i:\Delta_i>0} \Delta_i^{-1}\log T)$. We also provide two kinds of lower individual regret bounds: the first one, which is $\Omega(\sum_{i:\Delta_i>0} N^{-2}\Delta_i^{-1}\log T)$, is general and holds for all algorithms; the second one, which is  $\Omega(\sum_{i:\Delta_i>0} N^{-1}\Delta_i^{-1}\log T)$, holds for all algorithms with round-robin sampling, implying that \texttt{DRRB-bandit} is near-optimal among all round-robin-based algorithms.
Additionally, the total communication cost is bounded by $O(K\Delta_{\min}^{-1}\log T)$, where $\Delta_{\min}$ is the minimal non-zero gap.
The above results dramatically outperform existing results in previous works, among which the best one is $O(\sum_{i:\Delta_i>0} N^{-1}_{\min}\Delta_i^{-1}\log T)$ for the individual regret ($O(\sum_{i:\Delta_i>0}N N^{-1}_{\min}\Delta_i^{-1}\log T)$ for group regret), where $N_{\min}$ denotes the smallest number of neighbors for any agent, including the agent itself.
We provide a simple account of the results in Table~\ref{tab: compare-to-prior-literature}.

The improvement in this work is practically significant. Considering a practical case where $N_{\min}\ll N$, the individual and group regrets in previous works can be as large as $O(\sum_{i:\Delta_i>0} \Delta_i^{-1}\log T)$ and $O(\sum_{i:\Delta_i>0}N \Delta_i^{-1}\log T)$, respectively. 
Clearly, these results lead to linear regret with respect to the number of agents (or system size), whereas our approach eliminates the dependence on the number of agents, making it more practically significant for real-world applications of cooperative learning.     

Finally, we will introduce the organization of the paper below. Firstly, we introduce the necessary notations and the problem formulation in Section~\ref{section: problem formulation}. Secondly, we describe the framework of both \texttt{DRRB-bandit} and \texttt{CES} in Section~\ref{Section: algorithms}. In Section~\ref{Section: analysis}, we provide theoretical results on the regret for \texttt{DRRB-bandit}, with missed details deferred to the appendix. In Section~\ref{section: experiment}, we provide experimental results with varying settings.