\section{Algorithm}\label{Section: algorithms}
The first core challenge in the federated bandit problem is estimating the global mean based on biased local observations. During the game, each agent maintains its own observations or estimates of the local arms, which deviate from the global mean due to heterogeneous feedback. 
Consequently, to learn the global reward mean, agents have to aggregate the estimates or observations of all agents, and each agent is responsible for sampling the arms of themselves in the heterogeneous setting. During the above procedure, insufficient sampling by any agent will result in an imprecise estimate of the global mean. Upon revisiting previous works \citep{zhu2021federated,zhu2023distributed,xu2024decentralized}, we identify a key limitation that leads to suboptimal results: the UCB-based algorithm framework typically favors selecting the arm with the largest upper confidence bound, which leads to biased global estimates. 
Although these algorithms reduce some bias in decision-making at each round, they still produce biased decisions overall, which leads to uneven learning. As a result, they must rely on the worst-case scenario to compute concentration errors, which limits their effectiveness. %

To optimize algorithms for federated bandit problems, it is crucial to fully leverage the sample information from each agent.
In a homogeneous bandit setting \citep{hillel2013distributed,shahrampour2017multi,zhu2021distributed}, the samples from one agent can directly benefit the learning process of other agents through communication, as all agents share the same learning objectives. However, in heterogeneous bandit problems, additional challenges arise in algorithm design.
Specifically, in a heterogeneous setting, simply aggregating information from other agents does not ensure that it benefits an agent, as the reward distributions or environments may differ across agents.
To achieve optimal performance, agents need to accurately track both the observations/estimates and the sample counts associated with the observations from all other agents.
Given that the setting is fully distributed, with each agent only able to communicate with its neighbors, it becomes challenging for agents to learn the global mean. 
Therefore, addressing the heterogeneity in federated bandit problems is the second key challenge explored in this paper.

To address the first challenge, we adopt a round-robin-based algorithm framework, where each agent uniformly explores its local arms at each round. We apply this framework to federated bandit problems and introduce the Distributed Round-Robin-Based Bandit Algorithm (\texttt{DRRB-bandit}) in Section~\ref{section: distributed successive elimination}. 
In the algorithm framework, each agent can maintain a dynamic candidate arm set and sample arms in the set equally until one arm is judged as suboptimal and eliminated from the set. 
The agents can receive real-time implicit information, i.e., the concrete sample counts of other agents, equal to the sample counts themselves. All agents share the confidence interval for the same arm because all agents uniformly explore these arms. 
Hence, the worst case can be avoided and the algorithm obtains a near-optimal result. 

For the second challenge, one intuition is to design a suitable online estimation algorithm based on the quality of networks. We provide an estimation policy called consensus estimation subroutine (\texttt{CES}) in Section~\ref{Section: distributed estimation}.  
In \texttt{CES}, each agent combines other agents' global estimates and its latest sampling in a dynamic proportion. These global estimates contain information about other unconnected agents. Hence, the policy can counter the information congestion caused by the incomplete communication graph. 
Over a few rounds, the latest global estimate can gradually get rid of the effects of heterogeneity. 

Based on the two ideas, the exploration efficiency will increase by $N$ times, as each single agent can fully utilize the exploration of all agents and the influence of the heterogeneous feedback could be reduced. 

\subsection{Distributed round-robin-based bandit algorithm  (\texttt{DRRB-bandit})}\label{section: distributed successive elimination}

We present a federated bandit learning algorithm called \texttt{DRRB-bandit}, which employs round-robin sampling as the underlying arm-pulling policy. A key idea behind \texttt{DRRB-bandit} is that agents uniformly sample arms to track the global mean of each arm.
Using \texttt{DRRB-bandit}, agents select arms through round-robin sampling and eliminate suboptimal arms by comparing the upper confidence bounds of each suboptimal arm with the lower confidence bound of the optimal arm. By incorporating time labels on the suboptimal arms, the algorithm ensures that all agents avoid asynchronous elimination, which is typically caused by time delays in a fully distributed communication graph.

To ensure synchronous sampling, the algorithm maintains a candidate arm set, containing arms to be explored in a round-robin manner. The candidate arm set is initialized as the arm set \(\mathcal K\). As the sample count increases, the algorithm gradually identifies suboptimal arms and removes them from the candidate arm set until only one remains. 
When an agent identifies a suboptimal arm, it will notify its neighbors of this information. To ensure all agents eliminate a suboptimal arm synchronously, a time label will be transmitted along with this arm. 
The time label indicates the time slot at which all agents have received the suboptimal arm, accounting for the time delay caused by fully distributed communication. 
The design of the time label ensures that all agents update the candidate arm set simultaneously. The pseudocode of \texttt{DRRB-bandit} is summarized in Algorithm~\ref{alg: DAEE}.

\paragraph{Round-Robin Policy for Exploration.}
In successive elimination algorithms, each agent pulls arms from the arm set using round-robin sampling. Each agent $j$ maintains a dynamic candidate arm set \(\mathcal S_j(t)\), which initially includes all arms, i.e., $\mathcal{S}_j(0)=\mathcal{K}$. Over time, this set updates based on the agent's exploration and the information it shares with neighbors.
The bandit algorithm operates on the candidate arm set, with agents using round-robin sampling to learn the local reward distribution and estimate the global mean for each arm. By learning the global means, agents can identify suboptimal arms. In addition, each agent shares information about suboptimal arms with its neighbors. Based on the identified suboptimal arms, agents update their candidate arm set accordingly.
Arms identified as suboptimal are eliminated from the candidate arm set at a predetermined time, as specified in Lines~\ref{DRRB_13} and \ref{DRRB_14} of Algorithm~\ref{alg: DAEE}. The policy for arm elimination will be further explained below.

\paragraph{Arm Elimination.}
To manage the candidate arm set based on information from other agents, we introduce an \textit{elimination arm set}, denoted as $\mathcal{B}_j$, which stores the indices of arms identified as suboptimal and slated for elimination.
At the beginning of each round, the algorithm selects the arm with the highest global estimate, \(\tilde \mu_{i^{\max}, j}\), as the benchmark. Then it compares the global estimates of the arms in the candidate set to this maximum value. If one arm's global estimate is lower than the benchmark by a threshold related to the radius of its confidence interval, that arm is considered suboptimal and is added to the elimination arm set $\mathcal{B}_j$ (Line~\ref{DRRB_8}, Algorithm~\ref{alg: DAEE}). 

In a fully distributed communication graph, each agent has distinct capabilities for collecting and processing information. To manage the updates of the candidate sets across all agents, a time label $t_i$ is assigned to each suboptimal arm $i$. This label incorporates both the communication delay inherent in the distributed system and the time at which the arm is identified as suboptimal.
Using the predetermined time label $t_i$, agents can synchronize the elimination of the suboptimal arm, ensuring that all agents remove it from their candidate sets at the same time.

At each time slot $t$, agent $j$ samples all arms in the candidate set $\mathcal{S}_j(t)$.
Let $\tau_{i,j}(t)$ denote the number of samples of arm $i$ by agent $j$ up to time slot $t$. Since all agents update their candidate set synchronously, the number of observations of all agents on arm $i$ is equal. Thus, the total sample count for arm $i$ is $\tau_{i}(t)=N\tau_{i,j}(t)$.  
Let $\tilde{\mu}_{i,j}(t)$ denote the estimate of the global mean on arm $i$ by the $j$-th agent (A detailed explanation is given in \texttt{CES} (Algorithm~\ref{alg: DE})). Based on the global estimate $\tilde{\mu}_{i,j}$ and the sample count $\tau_{i,j}$, we can construct a confidence interval for the global reward mean $\mu_i$, which typically follows the Hoeffding’s inequality \citep{hoeffding1994probability}. 
Define $U_{i,j}(t,\delta)$ as the radius of the confidence interval for the rewarding process with $\tau_{i,j}(t)$ samples and confidence level $1-2\delta$, which is written as
\begin{equation}\label{CI}
U_{i,j}(t,\delta)\coloneqq\sqrt{\frac{\log\delta^{-1}}{2N\tau_{i,j}(t)}}+\frac{Q}{(1-\lambda_2)(\tau_{i,j}(t)+1)},
\end{equation}
where $\delta$ specifies the violation probability that the true mean lies outside the above confidence interval (The details and analysis are introduced in Lemma~\ref{lemma: distributed estimation}). The global reward mean $\mu_i$ is contained within the confidence interval $(\tilde{\mu}_{i,j}(t)-U_{i,j}(t,\delta),\tilde{\mu}_{i,j}(t)+U_{i,j}(t,\delta))$ with at least $1-2\delta$ probability. For simplicity, we use $\texttt{UCB}_{i,j}$ and $\texttt{LCB}_{i,j}$ to represent the upper and lower confidence bounds of $\mu_i$, respectively. 

\begin{algorithm}[tb]
    \caption{Distributed Round-Robin-based Bandit Algorithm (\texttt{DRRB-bandit}) (for agent $j$)}\label{alg: DAEE}
    \textbf{Input}: The time horizon $T$, the diameter $D$ and the arm set $\mathcal{K}$ \\
    \textbf{Initialization}: $t=0$, $\tau_{i,j}=0$, $U_{i,j}=1$, $\mathcal{S}_j=\mathcal{K}$, $\mathcal{B}_j=\varnothing$\\
    \begin{algorithmic}[1] %[1] enables line numbers
        \STATE Pull each arm one time and receive a local reward $X_{i,j}$ \label{DRRB_1}
        \STATE $\tilde{\mu}_{i,j}\gets X_{i,j}$, $\tau_{i,j}\gets \tau_{i,j}+1$, $t\gets t+K $, $i\in\mathcal{K}$ \label{DRRB_2}
        \WHILE{$t\leq T$} \label{DRRB_3}
        % \STATE $\tilde{\mu}_{i^{\max},j}\gets\max\{\tilde{\mu}_{i,j}: i\in \mathcal{S}_j\}$ 
        \STATE $i^{\max}\gets \arg\max_i\{\tilde{\mu}_{i,j}: i\in \mathcal{S}_j\}$ \label{DRRB_4}
        \FOR{$i \in \mathcal{S}_j$}  \label{DRRB_5}
        \STATE Pull arm $i$ and obtain the reward $X_{i,j}$ \label{DRRB_6}
        \STATE $t\gets t+1$, $\tau_{i,j} \gets {\tau}_{i,j} + 1$
        \IF{$\tilde{\mu}_{i,j} < \tilde{\mu}_{i^{\max} ,j}-2U_{i,j}$}  \label{DRRB_8}
        \STATE $t_i\gets t+\lvert\mathcal{S}_j\rvert D$, $\mathcal{B}_j \gets \mathcal{B}_j \cup \{i\} $  \label{DRRB_9}
        \ENDIF
        \STATE Update $U_{i,j}$ via equation \eqref{CI} \label{DRRB_7}
        \ENDFOR
        \STATE Operate Subroutine~\ref{alg: DE} for the latest global estimates \label{DRRB_11}
        \FOR{each arm $i$ in $\mathcal{B}_{j}$ whose $t\geq t_i$} \label{DRRB_13}
        \STATE \textbf{if} $\lvert\mathcal{S}_j\rvert>1$ \textbf{then} $\mathcal{S}_j\gets \mathcal{S}_j\backslash\{i\}$; \textbf{else} $\mathcal{S}_j\gets \mathcal{S}_j$\label{DRRB_14} 
        \ENDFOR
        \ENDWHILE
    \end{algorithmic}
\end{algorithm}

For any arm $i$ in the candidate set $\mathcal{S}_j(t)$, it will be considered suboptimal if the global estimate of arm $i$ and $i^{\max}$ satisfies
\begin{equation}\label{criterion}
\begin{split}
    \underbrace{\tilde{\mu}_{i,j}(t)+U_{i,j}(t,\delta)}_{\texttt{UCB}_{i,j}}\geq \underbrace{\tilde{\mu}_{i^{\max},j}(t)-U_{i,j}(t,\delta)}_{\texttt{LCB}_{i^{\max},j}},
\end{split}
\end{equation}
where $i^{\max}$ is the arm with the maximum global mean estimate among all arms in $\mathcal{S}_j(t)$ and is determined at the beginning of each round (Line~\ref{DRRB_4}, Algorithm~\ref{alg: DAEE}). 


In \texttt{DRRB-bandit}, once agent \(j\) identifies arm $i$ as suboptimal, it will add the arm to the elimination arm set \(\mathcal{B}_j\) and broadcast its index to all other agents. 
For each arm $i$, a predetermined time label $t_i$ is assigned,  which indicates the time at which the arm will be removed. 
The value of $t_i$ is set according to the following equation:
\begin{equation}\label{time-label}
    t_i=t+\lvert\mathcal{S}_j(t)\rvert D,
\end{equation}
where $D$ is the diameter of the communication graph $\mathcal{G}$, and $t$ represents the time slot when the arm is identified as suboptimal. $\lvert\mathcal{S}_j(t)\rvert$ represents the element number in $\mathcal{S}_j(t)$. Given the indexes of suboptimal arms, an elimination arm set $\mathcal{B}_j(t)$ of agent $j$ at time $t$ is constructed, which contains all arms identified as suboptimal, i.e.,
\begin{equation}\label{elimination set}
\begin{split}
    \mathcal{B}_j(t)=\{&i,i\in\mathcal{S}_j:\exists i^{\prime}\in\mathcal{S}_j\setminus\{i\}~~\text{such that}\\
    &\qquad\tilde{\mu}_{i,j}(t)\leq \tilde{\mu}_{i^{\prime},j}(t)-2U_{i,j}(t,\delta)\}.
\end{split}
\end{equation}

By continuously monitoring the elimination set, the algorithm iteratively updates the candidate set until the optimal arm is identified.

\subsection{Consensus Estimation Subroutine (\texttt{CES})}\label{Section: distributed estimation}

To mitigate the biased estimation arising from the heterogeneous setting, we propose a novel consensus estimation subroutine in the federated bandit setting with fully distributed communication, which can be integrated to \texttt{DRRB-bandit} (Introduced in Section~\ref{section: distributed successive elimination}). 

The key idea of \texttt{CES} is synthesizing the information exchanged from each agent's neighborhood and estimating the global mean without bias. In this section, we propose a \emph{fair} mechanism where the samples of all agents are equally used to estimate the global mean. 
By properly configuring \texttt{CES}, each agent ensures a fair global estimate, which identifies suboptimal arms more accurately and rapidly.

\begin{algorithm}[tb]
    \caption{Consensus Estimation Subroutine (\texttt{CES}) (for agent $j$)}\label{alg: DE}
    \textbf{Input}: The local reward $X_{i,j}$, the candidate arm set $\mathcal{S}_{j}$, the function of weight coefficient $\sigma_i(\tau)=\frac{1}{\tau+1}$, the sample count $\tau_{i,j}$ and the weight matrix $W=[\omega_{j,j^{\prime}}]_{N\times N}$\\
    \textbf{Output}: The latest estimate $\tilde{\mu}_{i,j}$ and elimination arm set $\mathcal{B}_j$ \\
    \begin{algorithmic}[1] 
        \STATE Send $\tilde{\mu}_{i,j}$, $t_i$, $i\in\mathcal{S}_j$ and $\mathcal{B}_j$ to neighbors \label{CES_2}
        \STATE Receive $\tilde{\mu}_{i,j^{\prime}}$, $t_i$ and $\mathcal{B}_{j^{\prime}}$ from neighbors $j^{\prime}\in \mathcal{N}_j$ \label{CES_3}
        \FOR{$i\in\mathcal{S}_j$}
        \STATE Update the weight coefficient $\sigma_i$ and compute the latest global estimate as follows
        $$\tilde{\mu}_{i,j}\gets(1-\sigma_i)\sum_{j^{\prime}\in\mathcal{N}_j\cup \{j\}}\omega_{j,j^{\prime}}\tilde{\mu}_{i,j^{\prime}}+\sigma_i X_{i,j}$$
        \ENDFOR
        \FOR{$j^{\prime}\in\mathcal{N}_j$}\label{CES_9}
        \STATE Update the elimination arm set via $\mathcal{B}_j\gets \mathcal{B}_j \cup \mathcal{B}_{j^{\prime}}$ \label{CES_10}
        \ENDFOR
    \end{algorithmic}
\end{algorithm}

In \texttt{CES}, agent $j$ combines the historical data from its neighborhood and its own real-time reward to obtain biased global estimates. 
As an example, we focus on demonstrating the consensus process in estimating the global mean of arm $i$. 
Up to time slot $t$, agent $j$ has sampled arm $i$ for $\tau_{i,j}(t)$ times and the reward obtained at $\tau_{i,j}(t)$-th sample is defined as $X_{i,j}^{\tau_{i,j}(t)}$. The global estimate of agent $j$ on arm $i$ is also defined as $\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}$.
In the communication phase, agent $j$ exchanges its previous estimate $\tilde{\mu}_{i,j}^{\tau_{i,j}(t)-1}$ among its neighborhood $\mathcal{N}_j$ (Line~\ref{CES_2}-\ref{CES_3}, Algorithm~\ref{alg: DE}). Based on the historical observations from the neighborhood and the real-time reward $X_{i,j}^{\tau_{i,j}(t)}$, each agent $j$ updates its latest global estimate $\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}$ as follows
\begin{equation}\label{iteration}
\begin{split}
    \tilde{\mu}_{i,j}^{\tau_{i,j}(t)}\coloneqq&(1-\sigma_i(\tau_{i,j}(t)))\sum_{j^{\prime}\in\mathcal{N}_j\cup \{j\}}\omega_{j,j^{\prime}}\tilde{\mu}_{i,j^{\prime}}^{\tau_{i,j^{\prime}}(t)-1}\\
    &+\sigma_i(\tau_{i,j}(t))X_{i,j}^{\tau_{i,j}(t)},
\end{split}
\end{equation}
where $\sigma_i(\tau_{i,j}(t))$ represents the weight coefficient that adjusts the contribution of each piece of information in the global estimate $\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}$. Additionally, the elimination arm set $\mathcal{B}_j(t)$ is also updated in \texttt{CES} (Lines~\ref{CES_9}-\ref{CES_10}, Algorithm~\ref{alg: DE}). Algorithm~\ref{alg: DE} provides the latest global estimates and the updated elimination arm set to \texttt{DRRB-bandit}.
