\section{Problem Description}\label{section: problem formulation}
In the MAB problem, a player repeatedly selects an arm from a given set $\mathcal{K}=\{1,\dots, K\}$ over time. At each time slot $t\in \{1,\dots, T\}$, the player chooses an arm to pull and obtain a reward associated with the selected arm. 
The rewards for each arm are drawn from an independent and identically distributed (i.i.d.) process, with values in the interval $[0,1]$ \footnote{Via Lemma~\ref{lemma5}, the results in this work also hold for other distributions, such as sub-Gaussian distributions, etc.}. This reward serves as real-time feedback to the player regarding the chosen arm.

In this article, we focus on federated bandit problems. Different from general MAB problems, this setting introduces two additional elements: multiple players and heterogeneous feedback. 
Specifically, we consider a stochastic bandit setting containing $N$ agents, represented by the agent set $\mathcal{N}$.
At each time slot $t$, agent $j$ selects an arm $A_j(t)\in\mathcal{K}$ to pull and receives a random reward $X_{A_j(t),j}(t)$. The decision-making strategy primarily depends on the agents’ past actions and observed rewards.

In this scenario, agent $j$ could only observe a random reward $X_{i,j}(t)$, which consists of both the mean and noise components. The reward \( X_{i,j}(t) \) follows an independent and identically distributed (i.i.d.) process with a reward mean \( \mu_{i,j} \), bounded within \([0,1]\). 
If agent \( j \) selects arm \( i \) at time step \( t \), i.e., \( A_j(t) = i \), the global reward at time \( t \) is defined as   $X_{A_j(t)}(t)=X_i(t):=\frac{1}{N}\sum_{j=1}^N X_{i,j}(t)$. Similarly, the global mean of $X_{i}(t)$ is given by $\mu_i:=\frac{1}{N}\sum_{j=1}^N\mu_{i,j}$. 
Without loss of generality, denote \( i^{\star} \) by the unique optimal arm with the highest global mean among all arms in the set \( \mathcal{K} \), i.e., $i^{\star} = \arg\max_i \mu_i$. The reward gap between any arm \( i \) and the optimal arm~\( i^{\star} \) is then defined as  $\Delta_i=\mu_{i^{\star}}-\mu_i$. 

After sampling, agents exchange information with their neighbors.
The neighborhood of agent \( j \) is defined as \( \mathcal{N}_j \), which consists of all agents connected to agent \( j \), excluding \( j \) itself. To represent the relationships among all agents, we use a communication matrix $\bm{W}=[\omega_{a,b}]_{N\times N}$ to describe the connectivity structure of the multi-agent system. Further details about this matrix are provided in Appendix~\ref{appendix: multi-agent system}. 
We assume that there are no collisions; that is, when multiple agents pull the same arm, each agent independently receives a reward sample drawn from the arm's reward distribution. 
It is important to note that the problem is set in a heterogeneous environment, meaning that the expected reward of arm \( i \) varies across different agents. Specifically, $\mu_{i,j_1}\neq\mu_{i,j_2}$ for $ j_1\neq j_2$.

\textbf{Group regret: }In this paper, group regret is defined as the cumulative loss of reward incurred by selecting a suboptimal arm instead of the optimal arm. This metric serves as the primary measure for evaluating federated bandit algorithms. The optimal strategy for all agents is to consistently pull the optimal arm throughout the entire time horizon $T$. Therefore, for a distributed algorithm $\mathcal{A}$, the expected group regret of the system is defined as follows:
\begin{equation}\label{group regret}
\mathbb{E}[{R^T}(\mathcal{A})]\coloneqq NT\mu_{i^{\star}}-\sum^T_{t=1}\sum_{j=1}^N\mathbb{E}[X_{A_j(t),j}(t)].
\end{equation}
\\
\textbf{Individual regret: }While group regret is a key performance metric for distributed algorithms, minimizing individual regret is also a crucial challenge in the federated bandit problem. 
In a heterogeneous setting, agents may pull the same arm but receive different local rewards, leading to variations in their global estimates. Moreover, an agent’s ability to access information depends on the structure of its neighborhood, resulting in disparities in decision-making.
Therefore, considering individual regret is essential to prevent overly aggressive behavior from any single agent. In practical applications, optimizing individual regret becomes even more critical. For instance, in cooperative systems like drone swarms, the failure of a single agent can significantly impact overall performance—a phenomenon known as the "cask effect."
To quantify the impact of individual decision-making, individual regret is defined as follows:
\begin{equation}\label{individual regret}
\mathbb{E}[{R_{j}^T}(\mathcal{A})]\coloneqq T\mu_{i^{\star}}-\sum^T_{t=1}\mathbb{E}[X_{A_j(t)}(t)].
\end{equation}

\textbf{Communication cost: }In our setting, we do not impose any restrictions on the type or size of messages exchanged during each communication round. When one agent sends one message, the communication round incurs a unit cost. For algorithm $\mathcal{A}$, the communication cost of the global systems is defined as 
\begin{equation}\label{communication cost}
\begin{split}
    \mathbb{E}[C^T(\mathcal{A})]&=\sum_{t=1}^T\sum^N_{j=1}\mathbb{I}\{\mathcal{I}_{j}(t)\},\\
\end{split}
\end{equation}
where $\mathbb{I}\{\cdot\}$ is an indicator function and $\mathcal{I}_{j}(t)$ represents the event that agent $j$ send messages to its neighbors at time slot $t$.

