\section{Simulation Results}\label{sec:simulation}

In this section, we evaluate the performance of our proposed \textbf{\algname}~algorithm and compare it with existing methods. We also analyze how different network connectivity and topology influence \textbf{\algname}'s performance.

\textbf{Datasets and models.} Unless specified, we use \( N = 100 \) clients for all experiments on hand-written character recognition (MNIST and EMNIST datasets \citep{cohen2017emnist}) and \( N = 25 \) clients for all experiments on image classification (CIFAR-10 and CIFAR-100 datasets \citep{krizhevsky2009learning}). We use a CNN (convolutional neural network) model for each client, following the settings of \cite{ruan2022fedsoft} with data from a mixture of \( S = 2 \) distributions, \(\mathcal{D}_A\) and \(\mathcal{D}_B\). \revise{Additional results using MobileNet-v2 are also included in Appendix~\ref{sec:mobilenet}}. Each client randomly draws 10\% to 90\% of its data from \(\mathcal{D}_A\)and the remainder from \(\mathcal{D}_B\) with unbalanced class \citep{marfoq2021federated}, image rotation \citep{ruan2022fedsoft}, or both. We thus create a portion of clients with significantly unbalanced data and guarantees the unique distribution of each client. We follow \citet{ruan2022fedsoft} and \citet{marfoq2021federated} for other parameter settings. Details are described in the Appendix \ref{sec:exsim}. The test accuracy is evaluated on each client's local test dataset, which is unseen during training.

\textbf{Client communications.} Unless specified, the client graph is a connected Erdős–Rényi (ER) random graph \citep{erdos1960evolution} with an average degree from 5 to 12; more details are in Appendix~\ref{sec:exsim}. %\carlee{where? I couldn't find them}. 
To avoid the label switching problem \citep{stephens2000dealing}, we calculate the cosine similarities of the model parameters received from other clients to ensure the consensus of the cluster.

\textbf{Baselines.} We compare \textbf{\algname}~with: (i) centralized and decentralized \textbf{FedAvg} \citep{mcmahan2017communication}; (ii) centralized and decentralized \textbf{FedEM} \citep{marfoq2021federated}, a prior soft clustering method; (iii) centralized and  decentralized versions of \textbf{FedSoft} \citep{ruan2022fedsoft}, which also uses soft clustering; (iv) centralized and decentralized \textbf{IFCA} \citep{ghosh2020efficient} using hard clustering; (v) centralized and decentralized \textbf{pFedMe} \citep{t2020personalized}, another state-of-the-art FL personalization approach without clustering; and (vi) \textbf{local} training on local dataset only.

\textbf{Additional results} will be included in the appendix due to page limit. These include:

\begin{itemize}
    \item We conduct ablation studies to analyze the impact of different factors on the performance of \textbf{FedSPD}, specifically evaluating:
    \begin{itemize}
        \item The influence of the number of local training epochs in Section~\ref{sec:fl_localep};
        \item The contribution of the final training phase in Section~\ref{sec:fl_final};
        \item The impact of the number of clusters in Section~\ref{sec:fl_cluster}; and
        \item More details of the effect of network connectivity in Section~\ref{sec:fl_edcon}. We also show how the dynamic network topology influences the performance of \textbf{\algname}.
    \end{itemize}
    \item In Section \ref{sec:IDA}, we evaluate the performance of \textbf{FedSPD} under a more challenging setting where the total amount of data is imbalanced across clients in addition to the data heterogeneity across clusters.
    \item To explore the potential for enhancing privacy guarantees in DFL, we incorporate Differential Privacy into \textbf{FedSPD} and present the results in Section~\ref{sec:dp}.
    \item For most experiments on the CIFAR-10/CIFAR-100 datasets, we adopt the same CNN model used by~\cite{ruan2022fedsoft} to ensure a fair comparison. To further assess the scalability of \textbf{FedSPD}, we evaluate its performance using a more complex architecture, MobileNet-V2, in Section~\ref{sec:mobilenet}.
\end{itemize}

\subsection{Comparison with Baselines}
We first compare our method with other decentralized personalized methods. Our results on EMNIST, CIFAR-10, and CIFAR-100 are shown in Table \ref{tab:test-cfl} and Table \ref{tab:test-dfl}. \textbf{\algname}~outperforms other DFL methods in most cases, approaching the accuracy of CFL. The centralized methods still outperform decentralized methods, as expected from prior literature \citep{sun2023mode}. However, decentralized methods offer advantages such as lower communication traffic and increased robustness, as they do not rely on a single point of failure like a centralized server. 

Figure \ref{fig:CMTA} shows the training accuracy versus number of epochs on the CIFAR-10 dataset. \textbf{\algname}~\textit{converges faster} than all other DFL algorithms in terms of training accuracy. \new{This shows that each of the clusters in \textbf{\algname} does converge as desired. Note that compared to \textbf{FedEM}, another soft clustering method, our \textbf{\algname} needs half the communication cost, since \textbf{FedEM} clients exchange the information of all $S = 2$ clusters.}

\begin{table*}[htbp]
          \centering
\begin{tabular}
{|c|c|c|c|c|c|c|}
\hline
\multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{DFL} & \multicolumn{5}{|c|}{CFL} \\ \hline
%\hline  
Dataset & \textbf{FedSPD} & FedEM & IFCA & FedAvg & FedSoft & pFedMe \\
\hline EMNIST & 83.07 & 88.83 & 89.42 & 88.81 & 84.97 & 90.95 \\
\hline CIFAR-10 & 68.72 & 79.64 & 79.52 & 79.36 & 76.62 & 79.43 \\
\hline CIFAR-100 & 40.38 & 44.25 & 43.91 & 43.11 & 39.76 & 8.74\footnotemark \\
\hline
\end{tabular}
\caption{\sl \textbf{\algname}~has comparable test accuracy to CFL algorithms. Accuracy in percentage (\%) }
    \label{tab:test-cfl}
\end{table*}

\begin{table*}[htbp]
          \centering
\begin{tabular}
{|c|c|c|c|c|c|c|c|}
\hline
\multicolumn{1}{|c|}{} & \multicolumn{6}{|c|}{DFL} & \multicolumn{1}{|c|}{}\\ \hline
%\hline  
Dataset & \textbf{FedSPD} & FedEM & IFCA & FedAvg & FedSoft & pFedMe & Local\\
\hline EMNIST & 83.07 & 80.47 & \textbf{83.88} & 78.61 & 74.30 & 81.16 & 56.91\\
\hline CIFAR-10 & \textbf{68.72} & 50.45 & 52.39 & 49.21 & 42.38 & 49.48 & 41.82\\
\hline CIFAR-100 & \textbf{40.38} & 18.59 & 17.18 & 17.20 & 13.17 & 18.27 & 13.31\\
\hline
\end{tabular}
\caption{\sl \textbf{\algname}~achieves higher test accuracy than other DFL algorithms in most cases. Accuracy in percentage (\%) }
    \label{tab:test-dfl}
\end{table*}


To guarantee the \textbf{fairness across clients}, we show the box plot of the final test accuracy across different clients on EMNIST in Figure \ref{fig:BP}. \textbf{\algname}~has much \textit{less variance in accuracy} across different clients compared to all baselines except \textbf{pFedMe}, validating that its improvement in average accuracy does not come from high accuracy in a few clients. % Accuracies accross clients on MNIST and CIFAR-10 dataset for \textit{\algname} are shown in Figure \ref{fig:tt}. 


% \begin{figure}[htb]
% \begin{subfigure}{0.47\textwidth}
%     \centering
%     \includegraphics[width=1.0\linewidth]{Styles/Fig/ACM3.png}
%     \caption{\sl MNIST dataset with 50 clients.}
%     \label{fig:tt1}
% \end{subfigure}
% \hfill
% \begin{subfigure}{0.47\textwidth}
%     \centering
%     \includegraphics[width=1.0\linewidth]{Styles/Fig/ACC3.png}
%     \caption{\sl CIFAR-10 dataset with 25 clients.}
%     \label{fig:tt2}
% \end{subfigure}
% \caption{\sl \algname~consistently shows similar test accuracies across clients (sorted from low to high).}
% \label{fig:tt}
% \end{figure}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.90\linewidth]{Styles/Fig/TAcc3.png}
\caption{\sl Training accuracy of different DFL methods versus number of epochs on CIFAR-10 ($N=25$). \textbf{\algname}~ converges faster in terms of training accuracy compared to all other DFL methods.
}
\label{fig:CMTA}
\end{figure}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.90\linewidth]{Styles/Fig/boxplot_bold.png}
\caption{\sl Box-plot for accuracy across clients on EMNIST dataset. \textbf{\algname}~has much lower variance in test accuracy across clients.}
\label{fig:BP}
\end{figure}

\subsection{Effects of Network Connectivity}
\label{sec:sim_nc}
In this section, we investigate how the performance varies with the connectivity of the client network. Figure \ref{fig:CFCN} shows the test accuracy of different DFL methods under varying client connectivity on the CIFAR-100 dataset using the ER Random Graph, averaged over three experimental runs. \textbf{\algname}~consistently shows the highest test accuracies, though other methods' performance begins to increase as the graph becomes more connected (a higher probability of link formation). 
% \algname~outperforms other DFL method, especially in \textit{low-connectivity} network.

\begin{figure}[t]
    \centering
    \includegraphics[width=0.90\linewidth]{Styles/Fig/CNT2S.png}
\caption{\sl Test accuracy of different methods under different connectivity levels of an ER Random Graph on the CIFAR-100 dataset ($N=15$).  \textbf{\algname}~shows consistently high test accuracies compare to other DFL methods.}
\label{fig:CFCN}
\end{figure}

Tables \ref{tab:p2-MN} and \ref{tab:p2-EM} show the test accuracy of \textbf{\algname}~in different types of networks and connectivity levels. We use three different network topologies: the ER Random Graph; the Barabasi-Albert (BA) Model \citep{albert2002statistical} with preferential attachment representing the network following the power law; and the Random Geometrical Graph (RGG) \citep{penrose2003random}, which is often used in wireless communication and IoT scenarios that have high clustering effects.  We observe that the final test accuracy does not vary significantly across different network topologies and levels of connectivity in MNIST. In EMNIST, the test accuracy slightly increases when the average degree increases. The test accuracy is more stable in RGG under different connectivity, which we conjecture is due to RGG's highly clustered nature. Thus, as long as the network is connected, \textbf{\algname}~performs well in both high and low connectivity scenarios and across various types of networks. As we expect from Theorem~\ref{thm:4}, \textbf{\algname}~converges regardless of the network topology.

\subsection{Communication Overhead}

\textbf{\algname}~requires transmitting 50\% less data compared to \textbf{FedEM} (in the case $S=2$) since only a single model is transmitted by each client. As the number of clusters $S$ increases, our communication volume advantage grows. Compared to the decentralized versions of \textbf{FedAvg} and \textbf{FedSoft}, \textbf{\algname}~requires each client to send the same volume of data (equivalent to one model's parameters) in each round. \revise{However, since \textbf{\algname}~only requires clients to send their local models to their neighbors training models for the same cluster, the number of clients to which each client communicates in \textbf{\algname}~is smaller than in algorithms like \textbf{FedAvg} and \textbf{FedSoft}, which requires each client to send its local model to \textit{all} of its neighbors. \textbf{\algname}, \textbf{FedAvg} and \textbf{FedSoft} thus have comparable communication overhead if clients use multicast communication, but if they use point-to-point communication, \textbf{\algname}~will require less communication than \textbf{FedAvg} and \textbf{FedSoft} with full participation, due to having fewer recipients per client.}

\footnotetext{The centralized pFedMe on CIFAR-100 does not converge in the various settings of hyperparameters we tried.}

\subsection{Discussion}

As shown in Table \ref{tab:test-cfl} and Table \ref{tab:test-dfl}, local learning performs the worst among all algorithms, validating that all other methods benefit from exchanging information between clients to learn a better model. Among the DFL algorithms, \textbf{FedAvg}, the only one without personalization, typically performs the worst, indicating that personalization is beneficial in non-iid data distributions, as we would intuitively expect. However, an exception is observed with the \textbf{FedSoft} algorithm. In the CIFAR-10 and CIFAR-100 datasets, \textbf{FedSoft} performs poorly, nearing the accuracy of local training. We conjecture that this is due to the way \textbf{FedSoft} aggregates models, making it difficult to learn the correct cluster centers in a low-connectivity network, leading to suboptimal performance. Our \textbf{\algname} designs a new model update method to avoid such an issue. More detailed discussion comparing \textbf{\algname} and \textbf{FedSoft} can be found in Section \ref{sec:comp}.

\begin{table}[h!]
  \centering
\begin{tabular}
{|p{1.8cm}|p{0.75cm}|p{0.75cm}|p{0.75cm}|p{0.75cm}|p{0.75cm}|}
\hline  Avg. Degree & 6 & 8 & 10 & 12 & 14 \\
\hline ER & 92.86 & 92.93 & 93.37& 93.31& 93.26 \\
\hline BA & 93.06 & 92.58 & 92.56& 92.87& 93.17 \\
\hline RGG & 92.86 & 92.61 & 92.84& 93.49& 92.97 \\
\hline
\end{tabular}
\caption{\sl \textbf{\algname}~shows consistently high test accuracies on MNIST data for $N=50$ clients across different client network topologies.}
  \label{tab:p2-MN}
\end{table}

\begin{table}[h!]
  \centering
\begin{tabular}
{|p{1.8cm}|p{1.05cm}|p{1.05cm}|p{1.05cm}|p{1.05cm}|}
\hline  Avg. Degree & 8 & 12 & 16 & 20  \\
\hline ER & 79.79 & 82.26 & 84.28 & 84.49 \\
\hline BA & 79.45 & 82.13 & 84.58 & 84.73 \\
\hline RGG & 82.26 & 83.49 & 84.06 & 84.08\\
\hline
\end{tabular}
\caption{\sl \textbf{\algname}~shows consistently high test accuracies on EMNIST data for $N=50$ clients across different client network topologies.}
  \label{tab:p2-EM}
\end{table}

% \begin{figure*}[h!]
% \begin{subfigure}{0.32\textwidth}
%     \centering
%     \includegraphics[width=1.0\linewidth]{Styles/Fig/ER.jpg}
%     \caption{\sl Training Accuracy of ER Graph.}
%     \label{fig:ER}
% \end{subfigure}
% \hfill
% \begin{subfigure}{0.32\textwidth}
%     \centering
%     \includegraphics[width=1.0\linewidth]{Styles/Fig/BA.png}
%     \caption{\sl Training Accuracy of BA Model.}
%     \label{fig:BA}
% \end{subfigure}
% \hfill
% \begin{subfigure}{0.32\textwidth}
%     \centering
%     \includegraphics[width=1.0\linewidth]{Styles/Fig/RGG.png}
%     \caption{\sl Training Accuracy of RGG.}
%     \label{fig:RGG}
% \end{subfigure}
% \caption{\sl \algname~converges slightly faster on networks of higher average degree, with noisier convergence on highly clustered RGG graphs, on MNIST Data.}
% \label{fig:topology}
% \end{figure*}
