\section{Regret Analysis}\label{Section: analysis}
In this section, we summarize the theoretical results for the FMAB problem and present the near-optimal results from the perspectives of both individual and group regret.
To derive these regret bounds, we first introduce the following lemma, which characterizes a tighter confidence interval for the estimates compared to the previous work. Then, we present the results for FMAB in the form of Theorems.

The following lemma demonstrates the performance of \texttt{CES}, which achieves a bounded estimation error, with the upper bound of the error decreasing as the agent number $N$ based on limited samples.
\begin{lemma}\label{lemma: distributed estimation}
Assume that $X_{i,j}$ is an i.i.d. reward process with unknown mean $\mu_{i,j}$. 
Set $\sigma_i(\tau_{i,j}(t))=\frac{1}{\tau_{i,j}(t)+1}$.
Then, for any arm $i\in\mathcal{K}$, agent $j\in\mathcal{N}$ and time slot $t\in \{1,\dots,T\}$, with probability $1-2\delta$, $\delta \in (0,0.5)$, we have 
\begin{equation*}
  \lvert\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}-\mu_i\rvert\leq \sqrt{\frac{\log\delta^{-1}}{2N\tau_{i,j}(t)}}+\frac{Q}{(1+\tau_{i,j}(t))(1-\lambda_2)},  
\end{equation*}
where $Q$ is determined by the communication graph $\mathcal{G}$ and $\lambda_2$ is the second largest eigenvalue of matrix $W$. We have $Q=1$ if the graph $\mathcal{G}$ is balanced, otherwise, $Q=\sqrt{N}$.  
\end{lemma}

\paragraph{Proof Sketch of Lemma~\ref{lemma: distributed estimation}.}
The term $\lvert\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}-\mu_i\rvert$ can be upper bounded by $\lvert\hat{\mu}_{i,j}^{\tau_{i,j}(t)}-\mu_i\rvert+\lvert\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}-\hat{\mu}_i^{\tau_{i,j}(t)}\rvert$ via the triangle inequality.
The variable $\hat{\mu}_i^{\tau_{i,j}(t)}$ represents the global estimate under the full information communication, i.e., the sample rewards of the arm $i$ of all agents are accessible. 
The first term $\lvert\hat{\mu}_{i,j}^{\tau_{i,j}(t)}-\mu_i\rvert$ is bounded by Hoeffding's inequality (Lemma~\ref{lemma2}). 
For the second term, it is obtained by exploiting the properties of graph theory (Lemma~\ref{lemma1}): 
Iterating equation~\eqref{iteration} yields the relation between $\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}$ and $X_{i,j}^{\tau_{i,j}(t)}$, where each part matches that in $\hat{\mu}_i^{\tau_{i,j}(t)}$.
Detailed proofs are provided in Appendix~\ref{appendix: proof of lemma 1}. 

Lemma~\ref{lemma: distributed estimation} provides a better confidence interval for the global estimates $\tilde{\mu}_{i,j}$ compared to previous works. The radius of the confidence interval is ${N^2}$ and ${N}$ times smaller than that in~\citet{zhu2021federated} and \citet{xu2024decentralized}, respectively. While \citet{zhu2023distributed} achieved similar performance, their results only hold for fully connected communication graphs, i.e., each agent is directly connected to all others. 
The superiority of our interval is clearly reflected in equation~\eqref{iteration}, which ensures that the proportion of each reward $X_{i,j}^{\tau_{i,j}(t)}$ in the global estimate $\tilde{\mu}_{i,j}^{\tau_{i,j}(t)}$ is $\frac{1}{N{\tau_{i,j}(t)}}$, allowing agents to estimate the global mean more accurately.

\subsection{Upper bounds}\label{section: upper bound}
\begin{theorem}[Regret upper bound]\label{upper bound analysis}
Let $U_{i,j}(t,\delta)$ in equation \eqref{CI} with $\delta=T^{-2}$ be the radius of the confidence interval of a random $[0,1]$-valued i.i.d. process. Given $\gamma>0$, 
\texttt{DRRB-bandit} for FMAB problems achieves the following performance, with a probability of at least $1-2TKN\delta$.
\begin{enumerate}
    \item[(i)] Individual regret:\\
    \[\begin{split}
        \mathbb{E}[{R_{j}^T}(\mathcal{A})]
    \leq &\underset{i:\Delta_i>0}{\sum}\frac{16\log T}{N\Delta_i}+ \underset{i:\Delta_i>0}{\sum}(D+1)\Delta_i\\
     &+\frac{8Q(K-1)}{1-\lambda_2}+1,
    \end{split}\]
    \item[(ii)] Group regret:\\
    \[
    \begin{split}
        \mathbb{E}[R^T(\mathcal{A})]\leq &\sum_{i:\Delta_i>0}\frac{16\log T}{\Delta_i}+\underset{i:\Delta_i>0}{\sum}N(D+1)\Delta_i\\
     &+\frac{8NQ(K-1)}{1-\lambda_2}+1.
    \end{split}
    \]
\end{enumerate}
\end{theorem}

\paragraph{Proof Sketch of Theorem~\ref{upper bound analysis}.}

To bound the individual regret ${R^T_j}(\mathcal{A})$, we first need to determine an upper bound for the sample counts. Based on equation~\eqref{criterion}, we can derive an instance-dependent upper bound for the sample counts. However, in distributed networks, communication delays between agents cause the upper bound derived from equation \eqref{criterion} to be inaccurate.
To ensure synchronization in sampling, agents will additionally sample suboptimal arms. Therefore, we bound the sample count by considering the diameter of the communication graph and the theoretical result from equation~\eqref{criterion}. Subsequently, by performing a regret decomposition, we combine the regrets for each arm sampled by agent $j$ to obtain the individual regret ${R^T_j}(\mathcal{A})$.
Detailed proofs are provided in Appendix~\ref{appendix: proof of theorem 1}.

In Theorem~\ref{upper bound analysis}, there exists an uncertain term $\frac{1}{1-\lambda_2}$ which is bounded in the following corollary.

\begin{corollary}[An extension of Theorem~\ref{upper bound analysis}]\label{corollary1}
Under the condition of Theorem~\ref{upper bound analysis}, the further bound of the regret is 

\begin{enumerate}
    \item[(i)] Individual regret:\\
    \[\begin{split}
        \mathbb{E}[R_{j}^T(\mathcal{A})]
    \leq&\sum_{i:\Delta_i>0}\frac{16\log T}{N\Delta_i}+8KQDN^2\\
     &+\sum_{i:\Delta_i>0}(D+1)\Delta_i+1,
    \end{split}\]
    \item[(ii)] Group regret:\\
    \[
    \begin{split}
    \mathbb{E}[R^T(\mathcal{A})]
    \leq&\sum_{i:\Delta_i>0}\frac{16\log T}{\Delta_i}+8KQDN^3\\
     &+\sum_{i:\Delta_i>0}N(D+1)\Delta_i+1.
    \end{split}
    \]
\end{enumerate}
\end{corollary}

\begin{corollary}[Instance-independent regret bound]\label{corollary2}
Under the conditions of Theorem~\ref{upper bound analysis}, the instance-independent upper bound of \texttt{DRRB-bandit} for FMAB problems achieves the following performance:
\item[(i)] Individual regret:
\[
\begin{aligned}
\begin{split}
    \mathbb{E}[R_{j}^T(\mathcal{A})]\leq& 8\sqrt{\frac{KT\log T}{N}} + K(D+1)\\
    &+ \frac{8QK}{1-\lambda_2}+1,
\end{split}   
\end{aligned}
\]
\item[(ii)] Group regret:
\[
\begin{aligned}
\begin{split}
    \mathbb{E}[R^T(\mathcal{A})]\leq &8\sqrt{KNT\log T} + KN(D+1)\\
    & + \frac{8NQK}{1-\lambda_2}+1.
\end{split}
\end{aligned}
\]
\end{corollary}

\begin{theorem}[Communication cost]\label{theorem: communication cost}
Under the conditions of Theorem~\ref{upper bound analysis}, \texttt{DRRB-bandit} suffers the communication cost at most
$$ \mathbb{E}[C^T(\mathcal{A})] \le \frac{16K\log T}{\Delta_{\min}^2} + \frac{8KNQ}{(1-\lambda_2)\Delta_{\min}} +KN( D + 1), $$
where $\Delta_{\min} = \min_{i : \Delta_i > 0} \Delta_i$.
\end{theorem}
\paragraph{Proof Sketch of Theorem~\ref{theorem: communication cost}.} In the proof of Theorem~\ref{upper bound analysis}, one can deduce that the suboptimal arm $i$ is sampled by agent $j$ at most $\frac{8\log\delta^{-1}}{N\Delta_i^2} + \frac{8Q}{(1-\lambda_2)\Delta_i} + D + 1$ times. In Theorem~\ref{upper bound analysis}, the violation probability is denoted by $\delta = \frac{1}{T^2}$, then the sample count is bounded by
$$ \tau_{i,j} \le \frac{16\log T}{N\Delta_i^2} + \frac{8Q}{(1-\lambda_2)\Delta_i} + D + 1. $$

In each round, \texttt{DRRB-bandit} collects information about all arms and communicates it with other agents in a single batch. Therefore, to determine the maximum number of communications, it suffices to consider the number of samples of the arm that remain in the candidate set for the second longest period.

\begin{remark}
Although the proposed algorithm requires knowledge of the time horizon $T$ to set $\delta$ and achieve near-optimal regret performance, in practice, when $T$ is unknown, the algorithm can still perform similarly when designing $\delta$ as a tunable function, such as $\delta=1/t^2$.
\end{remark}


\subsection{Lower bounds}\label{section: lower bound}
Besides the upper bounds, we also present lower bounds for FMAB problems. We investigate the lower bounds of both individual and group regrets. For the regret lower bound, we derive two separate lower bounds corresponding to two distinct cases (Theorems~\ref{lower bound analysis} and \ref{lower bound analysis2}).

\begin{theorem}[General regret lower bound]\label{lower bound analysis}

For FMAB problems with any number of agents, arms, and stochastic rewards satisfying a $1$-Gaussian distribution, if the graph $\mathcal{G}$ is connected, any federated bandit algorithm must incur regrets at least:
\begin{enumerate}
    \item[(i)] Individual regret:
    $$\liminf_{T\to\infty} \frac{\mathbb{E}[{R}_j^T(\mathcal{A})]}{\log T} \ge \sum_{i:\Delta_i>0} \frac{2}{N^2\Delta_i}.$$
    \item[(ii)] Group regret:
    $$\liminf_{T\to\infty} \frac{\mathbb{E}[{R}^T(\mathcal{A})]}{\log T} \ge \sum_{i:\Delta_i>0} \frac{2}{N\Delta_i}.$$
\end{enumerate}
\end{theorem}

\begin{theorem}[Regret lower bound for algorithms with round-robin sampling]\label{lower bound analysis2}
For FMAB problems with any number of agents, arms, and stochastic rewards satisfying a $1$-Gaussian distribution, if the graph $\mathcal{G}$ is connected, any federated bandit algorithm using round-robin sampling must incur regrets at least:
\begin{enumerate}
    \item[(i)] Individual regret:
    $$\liminf_{T\to\infty} \frac{\mathbb{E}[{R}_j^T(\mathcal{A})]}{\log T} \ge \sum_{i:\Delta_i>0} \frac{2}{N\Delta_i}.$$
    \item[(ii)] Group regret:
    $$\liminf_{T\to\infty} \frac{\mathbb{E}[{R}^T(\mathcal{A})]}{\log T} \ge \sum_{i:\Delta_i>0} \frac{2}{\Delta_i}.$$
\end{enumerate}
\end{theorem}

\begin{remark}
In Section~\ref{section: lower bound}, we present two types of lower bounds for FMAB problems: one general bound in Theorem~\ref{lower bound analysis}, and a specific bound for the class of round-robin-based algorithms in Theorem~\ref{lower bound analysis2}.
Theorem~\ref{lower bound analysis} provides general lower bounds for individual regret $\Omega(\sum_{i:\Delta_i>0}N^{-2}\Delta_i^{-1}\log T)$ and group regret $\Omega(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)$ under the strict assumption that all other agents' reward means are equal. This implies that each agent only needs to learn its local reward means.
Theorem~\ref{lower bound analysis2} gives the individual regret bound $\Omega(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)$ and group regret bound $\Omega(\sum_{i:\Delta_i>0}\Delta_i^{-1}\log T)$ for all round-robin-based algorithms.
\end{remark}

\begin{remark}
According to Theorems~\ref{upper bound analysis} and \ref{lower bound analysis}, we have shown that the lower and upper bounds match in terms of agent number $N$, reward gap $\Delta_i$, and time horizon $T$ for the class of algorithms based on round-robin sampling. However, for general algorithms, we have been unable to prove the optimality of \texttt{DRRB-bandit} due to the complexity of decision-making in multi-agent systems. Recalling the algorithms from previous works, we have improved both the individual and group regret bounds to $O(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)$ and $O(\sum_{i:\Delta_i>0}\Delta_i^{-1}\log T)$, respectively.
\end{remark}


\begin{remark}
In the special case of homogeneous FMAB problems ($\mu_{i,j}=\mu_i$ for all agents), the regret upper bounds of Theorem~\ref{upper bound analysis} match the known individual and group lower bounds, $\Omega(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)$ and $\Omega(\sum_{i:\Delta_i>0}\Delta_i^{-1}\log T)$~\citep{wang2020optimal,wang2023achieving}.
Therefore, our algorithm, reduced to the easier homogeneous setting, is near-optimal.
\end{remark}


