\section{Detailed comparison between \textbf{\algname} and \textbf{FedSoft}}\label{sec:comp}

The training processes of \textbf{\algname} and \textbf{FedSoft} differ fundamentally. In \textbf{FedSoft}, local training utilizes a proximal objective function with regularization terms to account for the distance to each cluster center. Conversely, \textbf{\algname} trains using data associated with the selected cluster, where model parameters are updated based on an objective function that does not include the regularization term. This modification is critical in decentralized federated learning, where ensuring both convergence to an optimal solution and consensus on cluster centers is essential. Unlike centralized systems, which aggregate cluster centers at a single server, decentralized systems lack this central coordination, making consensus challenging.
\textbf{\algname} was inspired by \textbf{FedSoft}. Attempts were made to bound the consensus distance of cluster centers in \textbf{FedSoft}. However, the results suggest that in decentralized settings of \textbf{FedSoft}, the consensus may not be reached. Experimental results in Section 6 demonstrate \textbf{FedSoft}'s limitations in the decentralized scenarios. Specifically, \textbf{FedSoft}'s cluster centers fail to reach the optimal values due to its update rules:

\begin{itemize}
    \item Uniform Data Utilization: \textbf{FedSoft} updates each cluster center using all available data.
    \item Probabilistic Contribution: \textbf{FedSoft} uses probabilities proportional to the estimated data distribution among clusters to guide contributions to the selected cluster.
\end{itemize}

These update rules lead to gradients during local updates being biased towards a mixture of optimal cluster centers from all clusters, rather than the correct optimal center for the selected cluster. As datasets grow more complex and the optimal cluster centers diverge significantly (e.g., CIFAR-10 or CIFAR-100), this bias becomes more pronounced, causing degraded performance. This is evident in Section 6, where \textbf{FedSoft}'s performance deteriorates as the datasets shift from EMNIST to CIFAR-10 and CIFAR-100 in both centralized and decentralized scenarios.

In decentralized settings, the sparse updates exacerbate the difficulty for clients to estimate optimal cluster centers accurately. This is especially problematic for complex datasets like CIFAR-10 and CIFAR-100, where \textbf{FedSoft} performs significantly worse. This limitation highlights the need for a different approach, such as the one introduced by \textbf{\algname}.

In sum, in \textbf{\algname} during each round of the first step, clients train separate models using data associated with their selected clusters. This approach ensures consensus and optimality of the cluster centers. Unlike centralized scenarios, where clients share a unified cluster center, each client in decentralized systems maintains different cluster centers. This necessitates rigorous proof of consensus across clients, as described in our theoretical analysis. Thus, the proof techniques of \textbf{\algname} and \textbf{FedSoft} are completely different.

Once consensus on cluster centers is achieved, the second phase of \textbf{\algname} aggregates models by computing a weighted average of the cluster centers, aligning with \textbf{FedSoft}’s objective. To address the suboptimality of non-convex models, \textbf{\algname} incorporates an additional final phase of local training. This phase enables further exploration of the model parameters, mitigating suboptimality and enhancing overall performance.

\textbf{Differences in Theoretical Analysis.}
\begin{itemize}
    \item Relax of Assumption 2 in the \textbf{FedSoft}. Since \textbf{FedSoft} requires the Assumption 2 in their paper which state the $\beta$ similarity among all subproblems of different clusters. In other words, the optimals of different cluster centers need to be close enough to guarantee the bounded distance between the learned cluster center and the optimal cluster center. This is because they use all data to update the cluster center. If using the data from other cluster to update the selected cluster, this assumption is required to guarantee the gradient update is not going to far away from the optimal of the selected cluster. In contrast, the different update rule of our algorithm \textbf{\algname} on cluster center further guarantee the optimality of our algorithms and have a tighter bound of convergence without the requirement of this additional assumption.
    \item The proof of \textbf{FedSoft} is based on the centralized FL (CFL). Thus, consensus is automatically met with the centralized aggregation. However, they do not proof the effectiveness under the decentralized settings. In contrast, we proof the consensus of the cluster centers in \textbf{\algname} under decentralized FL. This is one of the main challenges of the theoretical analysis. Proving the consensus and convergence is much more difficult in decentralized FL (DFL) than the CFL, since all clients keep the different estimation of the cluster centers. Our attempts try to bound the consensus distance of \textbf{FedSoft} in DFL was failed since the way that FedSoft update its cluster center may not yield the same optimal for all clients. It may differ based on different neighboring clients.
\end{itemize}

