\section{Experiments}\label{section: experiment}
In this section, we conduct a series of numerical experiments to evaluate the performance of \texttt{DRRB-bandit}. All experiments are repeated for $50$ trials, with the means plotted as lines and the standard deviations represented by shaded regions.

\paragraph{Setups and Baselines.} 
In \texttt{DRRB-bandit}, we set the parameters as $N=8$, $K=10$, $Q=\sqrt{N}$ and $\delta=1/T^2$. To ensure a fair comparison, we use a ring graph, which is a connected graph with a second-largest eigenvalue $\lambda_2=0.5713$. Each agent is connected to four neighbors with whom it exchanges information. We compare both individual and group regrets of \texttt{DRRB-bandit} against two baselines, \texttt{Gossip-UCB} \cite{zhu2021federated} and \texttt{Dis-UCB} \cite{zhu2023distributed}, as outlined in Table~\ref{tab: compare-to-prior-literature}.
Both algorithms tackle the federated bandit problem within the UCB framework but fail to exploit the advantages of distributed learning fully. Consequently, the individual regret of \texttt{Dis-UCB} remains independent of the number of agents $N$. In contrast, the group regret grows linearly with $N$, highlighting a fundamental limitation in their distributed learning design. This shortcoming stems from the fact that these algorithms do not effectively utilize the multi-agent system's collaborative potential, and their consensus-based decision-making overlooks the benefits that can arise from coordinated exploration.

\paragraph{Observations.} 
Figure~\ref{fig1} reports regrets for the proposed algorithm and the baselines. Figures~\eqref{original} and \eqref{original1} show individual and group regrets for the three algorithms. These figures demonstrate that \texttt{DRRB-bandit} does not perform well during the initial phase, as it performs uniform sampling across all arms. However, after sufficient sampling, all agents successfully eliminate the suboptimal arms, and the regret stabilizes, remaining almost unchanged for the remainder of the period.
In Figure~\eqref{Vary arm number}, it is obvious that the increasing trend aligns with the individual regret bound $O(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)$, which increases with the number of arms $K$. This phenomenon can be easily explained: as the number of arms increases, the task of learning each arm’s reward becomes more difficult, leading to higher regret. 
Finally, when varying the number of agents, we observe a decreasing trend that corresponds to the $O(\sum_{i:\Delta_i>0}N^{-1}\Delta_i^{-1}\log T)$ individual regret bounds, which decrease with the number of agents $N$. In contrast, \texttt{Gossip-UCB} shows increasing regret, consistent with its regret bound $O(\sum_{i:\Delta_i>0}N\Delta_i^{-1}\log T)$. For \texttt{Dis-UCB}, since the number of neighbors for each agent remains fixed when the number of agents changes, we also observe increasing regret, as shown in Figure~\eqref{Vary agent number}. We also provide additional simulations of the homogeneous setting in Appendix~\ref{appendix: simulation}.

\begin{figure}[htb]
\centering
\begin{subfigure}{0.9\linewidth}
     \centerline{\includegraphics[width=\columnwidth]{Graph/title.jpg}}
\end{subfigure}
\begin{subfigure}{0.48\linewidth}
    \centerline{\includegraphics[width=\columnwidth]{Graph/1.png}}
    \caption{Individual regrets}
    \label{original}
\end{subfigure}
\begin{subfigure}{0.48\linewidth}
    \centerline{\includegraphics[width=\columnwidth]{Graph/2.png}}
    \caption{Group regrets}
    \label{original1}
\end{subfigure}\\
\begin{subfigure}{0.48\linewidth}
    \centerline{\includegraphics[width=\columnwidth]{Graph/3.png}}
    \caption{Regrets with different numbers of arms}
    \label{Vary arm number}
\end{subfigure}
\begin{subfigure}{0.48\linewidth}
    \centerline{\includegraphics[width=\columnwidth]{Graph/4.png}}
    \caption{Regrets with different numbers of agents}
    \label{Vary agent number}
\end{subfigure}
\caption{Performance comparison with different arms and agents.\label{fig1}}
\end{figure}