\section{Introduction}
\label{Intro}

%Federated Learning (FL) is an emerging approach for distributed clients to cooperatively learn from their data. One of the most widely used FL algorithms, known as FedAvg \citep{mcmahan2017communication}, enables client devices to collectively train a \textit{shared} machine-learning model. FedAvg and most of its variants rely on the availability of a central server to coordinate among clients, which we refer to as the \textit{centralized} federated learning (CFL) setting.\footnote{Note that clients in the CFL setting still train their models in a distributed fashion; the term ``centralized'' simply refers to the presence of a central server that can coordinate across the clients.} In such CFL settings, each client independently trains a model based on its local data and subsequently transmits the model parameters to a central server, which aggregates the updates and sends the aggregation back to the clients. However, when dealing with FL tasks that encompass a multitude of mobile or Internet of Things (IoT) devices as clients, communication delays and bottlenecks are common, which can impede CFL as it relies on clients' ability to communicate with the central server. Furthermore, such a single point-of-failure raises concerns about vulnerability to attacks and potential failures associated with this centralized approach.
%
%\textbf{Decentralized Federated Learning} (DFL) has been introduced to address the limitations of CFL \citep{lalitha2018fully}. DFL adopts a fully decentralized structure in which clients directly communicate and share their locally trained model parameters with neighboring clients, eliminating the need for a central server to facilitate such model sharing. This approach permits a more heterogeneous client communication network, as no single coordinating entity exists. By eliminating the server, DFL can lead to substantial savings in communication and computational resources~\citep{beltran2023decentralized}. It also mitigates vulnerability and robustness concerns associated with relying on one centralized server.
%
%To the best of our knowledge, most existing DFL works focus on learning the \textit{same} global model for all clients, with such algorithms aiming to achieve consensus on the optimal model across clients. However, such a global model may not perform well when deployed at clients with non-IID (independent and identically distributed) local data, which is a feature of most FL settings.
%Our goal is to design an efficient \textbf{personalized, decentralized federated learning algorithm} that can learn models personalized to each client's local data distribution, without requiring a central server.
%
%Personalization of a shared global model has been proposed in CFL settings to increase the performance of learned models when deployed at individual clients~\citep{ruan2022fedsoft,marfoq2021federated}. Generalizing such personalization methods to a DFL context, however, raises significant \textbf{technical challenges}. In particular, most DFL algorithms are designed to achieve \textit{consensus} on a model between clients: typically, clients share their local models only with a set of neighbors that is often a small subset of all clients present in the system. Thus, a significant challenge in the DFL setting is ensuring that, despite this limited communication, all clients benefit from all other clients' updates by eventually learning the globally optimal model (i.e., reaching a consensus on it)~\citep{beltran2023decentralized}. Learning personalized models, however, \textit{by definition requires a lack of consensus}, at least for clients with non-IID data. It is then difficult to quantify whether dissimilarities in clients' learned models are due to an undesirable lack of communication or are to be expected due to dissimilarities in their local data distributions. We overcome this challenge by \textit{quantifying similarities between client data with a clustering method}, in which we learn distinct models for different clusters of data, which may be more or less prevalent at different clients. We then finetune these models for different clusters to obtain models personalized to each client's data.
%
% The most na\"ive clustering method would simply assign each client to a single cluster based on its local data distribution, as proposed by \citep{ghosh2020efficient} in the CFL setting. However, such ``hard'' clustering implicitly assumes that multiple clients within the same cluster have the same distributions. In practice, this may not be the case. Thus, we instead take a \textit{soft clustering} approach, as has been previously studied in CFL settings~\citep{ruan2022fedsoft,marfoq2021federated}, in which each client's data is assumed to follow an (a priori unknown) mixture of distributions. One model is then learned for each cluster in this mixture. Intuitively, since clients may have data from all clusters, one might then require each client to train models for each cluster in each training round, as proposed by~\citep{marfoq2021federated}. However, this requirement induces \textit{significant training overhead} that scales linearly with the number of clusters. DFL settings in particular tend to feature resource-constrained clients with limited communication or computation capacity~\citep{nguyen2021federated}, for which such a training method is likely infeasible. Thus, we instead design a training algorithm that (i) learns each client's mixture coefficients, while (ii) ensuring that the clients reach a consensus on the models learned for each cluster distribution and (iii) does not require computing or communication resources that scale with the number of clusters.

% Federated Learning (FL) is an emerging approach for distributed clients to cooperatively learn from their data. One widely used FL algorithm, FedAvg \citep{mcmahan2017communication}, enables client devices to collectively train a \textit{shared} machine-learning model. FedAvg and its variants typically rely on a central server to coordinate among clients, known as the centralized federated learning (CFL) setting.\footnote{Note that clients in the CFL setting still train their models in a distributed fashion; the term “centralized” simply refers to the presence of a central server that coordinates across the clients.} In CFL, each client trains a model independently based on its local data and sends the model parameters to a central server, which aggregates the updates and sends the aggregation back to the clients. However, communication delays and bottlenecks are common when dealing with numerous mobile or IoT (Internet-of-Things) clients, impeding CFL. Additionally, this centralized approach raises concerns about vulnerability to attacks and potential failures due to the single point of failure~\citep{lalitha2018fully}.

% \textbf{Decentralized Federated Learning} (DFL) addresses CFL's limitations \citep{lalitha2018fully}. DFL adopts a fully decentralized structure where clients directly share their locally trained model parameters with neighboring clients, eliminating the need for a central server. This approach allows a more heterogeneous client communication network, leading to substantial savings in communication and computational resources \citep{beltran2023decentralized}, and mitigates vulnerability and robustness concerns associated with a central server. Most existing DFL works focus on learning the same global model for all clients, aiming to achieve consensus on the optimal model. However, a global model may not perform well when deployed at clients with non-IID (independent and identically distributed) local data. Our goal is to design an efficient \textbf{personalized, decentralized federated learning algorithm} that can learn models personalized to each client's local data distribution without requiring a central server.

% Personalization of a shared global model has been proposed in CFL settings to improve performance when deployed at individual clients \citep{ruan2022fedsoft, marfoq2021federated}. Generalizing such personalization methods to a DFL context raises significant \textbf{technical challenges}. Most DFL algorithms aim for \textit{consensus} on a model by sharing local models between their neighbors which are a small subset of all clients.
% A significant challenge in DFL is ensuring that, despite limited communication, all clients eventually benefit from each other's updates to learn the globally optimal model \citep{beltran2023decentralized}. However, learning personalized models \textit{by definition requires a lack of consensus} for clients with non-IID data. It is then difficult to determine whether dissimilarities in clients' learned models are due to communication issues or differences in local data distributions. We overcome this challenge by \textit{quantifying similarities between client data with a clustering method}, learning distinct models for different data clusters, which are then aggregated and finetuned to each client's data.

% A na\"ive clustering method would assign each client to a single cluster based on its local data distribution, as proposed by \citep{ghosh2020efficient} in CFL. However, such “hard” clustering assumes clients within the same cluster have identical distributions, which is often not the case. Instead, we take a \textit{soft clustering} approach, as studied in CFL settings \citep{ruan2022fedsoft, marfoq2021federated}, where each client's data follows an (a priori unknown) mixture of distributions. One model is learned for each cluster in this mixture. While clients may have data from all clusters, requiring each client to train models for each cluster in every round, as proposed by \citep{marfoq2021federated}, induces \textit{significant training overhead} that scales linearly with the number of clusters. DFL settings often feature resource-constrained clients with limited communication or computation capacity \citep{nguyen2021federated}, making such a training method infeasible. Therefore, we design a training algorithm that (i) learns each client's mixture coefficients, (ii) ensures consensus on models learned for each cluster, and (iii) unlike prior work%\carlee{unlike prior work}
% , does not require resources that scale with the number of clusters.
% We make the following \textbf{contributions:}
% \begin{itemize}
%     \item We design \textbf{\algname}, an algorithm for federated clients to use Soft clustering for training Personalized models in a Decentralized manner. \algname~allows clients to collectively reach a consensus on cluster-specific models and adapts estimates of clients' cluster mixtures over time, yet only requires each client to train at most \textit{one} cluster model in each training round.
%     \item We \textbf{prove that \algname~converges} in Theorem \ref{thm:4}. % ... \carlee{fill in once the details are clear}
%     Due to only requiring clients to train one cluster model in each round, our proof uses a significantly different approach than that taken by prior work on soft clustering in DFL settings, which requires clients to train models for each cluster in each round~\citep{marfoq2021federated}. 
%     \item We demonstrate on real-world datasets that \textbf{\algname~outperforms previously proposed DFL algorithms} (both personalized and not), and that on some datasets it can approach the accuracy of centralized training algorithms. Moreover, we show that \algname's performance is robust to different client communication topologies.
% \end{itemize}

% After outlining related work in Section~\ref{sec:related}, we introduce our model of DFL in Section~\ref{sec:formulation} and present our \algname~algorithm in Section~\ref{sec:algorithms}. We then prove \algname's convergence in Section~\ref{sec:math} and experimentally demonstrate its superiority to baselines in Section~\ref{sec:simulation}, before concluding in Section~\ref{sec:conclusion}.

Federated Learning (FL) is a popular approach for distributed clients to collaboratively learn from their local data. 
% One widely used FL algorithm, FedAvg \citep{mcmahan2017communication}, facilitates client devices in collectively training a \textbf{shared} machine learning model. 
The most popular FL algorithm, \textbf{FedAvg} \citep{mcmahan2017communication}, and most of its variants operate within a centralized federated learning (CFL) framework, where a central server coordinates the training process.\footnote{Note that clients in the CFL setting still train their models in a distributed manner; the term "centralized" simply refers to the presence of a central server managing the clients' interactions.} In CFL, each client in a training round independently trains a model on its local data and then sends the model parameters to a central server for aggregation, after which the aggregated model is broadcast back to the clients to begin a new training round. However, communication delays and bottlenecks often arise when a CFL system includes numerous mobile or IoT (Internet-of-Things) clients, hampering CFL’s efficiency. Furthermore, this centralized structure poses risks of attacks and failures due to the single point of failure at the central server \citep{lalitha2018fully}.

\textbf{Decentralized Federated Learning} (DFL) addresses these limitations by adopting a fully decentralized architecture where clients share their locally trained model parameters directly with neighboring clients, eliminating the need for a central server \citep{lalitha2018fully}. This approach can also % more flexible communication networks among clients, leading to 
reduce communication and computational costs \citep{beltran2023decentralized}. % while mitigating vulnerabilities associated with a central server. %\osman{What is a heterogeneous communication network?} 
However, most existing DFL methods focus on learning a single global model for all clients, aiming for consensus across clients. Such a global model may under-perform on clients with non-IID (independent and identically distributed) local data, as is commonly the case in federated learning. To address this challenge, we design an efficient \textbf{personalized, decentralized federated learning algorithm} that personalizes models to each client's data distribution without relying on a central server and preserves DFL's communication benefits by limiting the required communication between clients. \revise{We particularly focus on settings where clients are IoT devices using device-to-device communication protocols. Such settings often feature limited network connectivity, communication resources and computation resources, e.g., sensor-based environmental monitoring or vehicles learning personalized models of human driver preferences~\citep{nakanoya2021personalized}. % for , such as environmental monitoring or even augmented and virtual reality (AR/VR) for person or object identification. 
%In this approach, each client can benefit from neighboring clients with similar data and direct communication.
}

Personalization of a shared global model has shown to improve performance in CFL settings \citep{ruan2022fedsoft, marfoq2021federated}. However, extending such personalization methods to DFL poses significant \textbf{technical challenges}. DFL algorithms typically strive for consensus by sharing local models among neighboring clients, which represent only a subset of all clients. Ensuring that all clients can benefit from each other's updates despite limited communication is a key challenge \citep{beltran2023decentralized}. In contrast, learning personalized models requires intentionally maintaining differences in clients' models, particularly for non-IID data. This makes it difficult to distinguish whether model disparities are due to communication issues or differences in local data distributions. We overcome this challenge by \textit{quantifying similarities between client data} using a clustering-based method, allowing the training of distinct models for different data clusters, which are then personalized to each client's unique data mixture.

Prior works that seek to personalize models in DFL settings, including cluster-based methods, are typically straightforward extensions of personalization methods designed for CFL settings, which do not take into account the distinct communication patterns in DFL and thus perform poorly when the client network has poor connectivity. For example, a na\"ive clustering method assigns each client to a single cluster based on its data distribution \citep{ghosh2020efficient}. However, such "hard" clustering assumes identical distributions within the same cluster, which is rarely the case. Instead, we adopt a \textbf{soft clustering} approach, as explored in CFL settings \citep{ruan2022fedsoft, marfoq2021federated}, where each client's data is modeled as an unknown mixture of distributions, and a model is trained for each cluster in this mixture. Existing DFL soft clustering approaches require clients to train models for all clusters in every round \citep{marfoq2021federated}, imposing \textbf{significant training and communication overhead} that scales linearly with the number of clusters. This is particularly problematic in DFL scenarios, where clients often have limited communication and computation capacity \citep{nguyen2021federated}. Therefore, we introduce a training algorithm that (i) learns each client's mixture coefficients, (ii) ensures consensus on models for each cluster, and (iii) unlike prior work, avoids communication resource requirements that scale with the number of clusters. Our \textbf{contributions} are as follows:
\begin{itemize}
    \item We propose \textbf{\algname}, a novel FL algorithm for clients that utilizes soft clustering to train personalized models in a decentralized manner. \textbf{\algname}~allows clients to reach a consensus on cluster-specific models and adapt their cluster mixture estimates over time, while requiring each client to train only \textbf{one} cluster model per training round, significantly reducing communication.
    \item We \textbf{prove the convergence of \algname} in Theorem \ref{thm:4}. This proof adopts a different approach from prior work on soft clustering in DFL, which typically requires clients to train models for every cluster in each round \citep{marfoq2021federated}.
    \item We demonstrate through experiments on real-world datasets that \textbf{\algname~outperforms existing DFL algorithms} (both personalized and non-personalized). In some cases, \textbf{\algname} even approaches the accuracy of centralized algorithms. \revise{Furthermore, we show that \textbf{\algname} is \textbf{particularly effective in low-connectivity networks with computationally constrained clients}}.
\end{itemize}

Following a review of related work in Section~\ref{sec:related}, we present our DFL model in Section~\ref{sec:formulation} and introduce the \textbf{\algname}~algorithm in Section~\ref{sec:algorithms}. We then provide a convergence proof in Section~\ref{sec:math} and demonstrate the algorithm's superior performance in Section~\ref{sec:simulation}, before concluding in Section~\ref{sec:conclusion}.


\section{Related Work}\label{sec:related}

% \carlee{This should be a lot shorter. We can also take out the subsections.}

%The centralized FL resulted in higher latency due to bottlenecks, more vulnerable to system failures. DFL emerged as a solution by promoting decentralized model aggregation and reducing dependence on centralized servers. DFL can effectively reduce communication traffic with the centralized server, especially when communication resources are limited. 
\textbf{Decentralized Federated Learning} has its roots in decentralized optimization \citep{nedic2009distributed, wei2012distributed, zhang2021newton}  and in particular decentralized Stochastic Gradient Descent (SGD) \citep{lian2017can}. Several methods have been explored for decentralized optimization \citep{nedic2009distributed, wu2017decentralized, lu2020computation}, while the convergence analysis of decentralized SGD was first presented by \citet{yuan2016convergence} and \citet{sirb2018decentralized} with delayed information, highlighting decentralized SGD's advantages over centralized methods \citep{lian2017can}. This literature establishes conditions on client connectivity such that all local models will converge to a consensus model \citep{lian2017can}. The effects of client communication topologies in DFL~\citep{lalitha2018fully,warnat2021swarm} have also been studied, and gradient tracking techniques based on push-sum algorithms have been proposed to relax the assumptions on client connectivity needed to show consensus \citep{nedic2014distributed, nedic2016stochastic, assran2019stochastic}.


%The traditional FL approach of learning a single model for all clients may not converge when clients' data is highly non-IID~\citep{mcmahan2017communication}, and even if it does converge, may yield poor performance at some clients that can even discourage affected clients from participating in the FL process~\citep{huang2020efficiency}. 
\textbf{Personalization} in CFL is generally motivated by highly non-IID client data~\citep{mcmahan2017communication, collins2021exploiting}, which can impede convergence and lead to a global model performing poorly at some clients, which may discourage them from participating in the FL process~\citep{huang2020efficiency}.
Common techniques include local finetuning \citep{sim2019personalization}, model interpolation \citep{mansour2020three}, meta-learning \citep{fallah2020personalized}, adding regularization terms \citep{t2020personalized}, and multi-task learning \citep{smith2017federated, yousefi2019multi, li2021ditto}. Clustered FL in particular includes hard clustering, which partitions clients into clusters based on their data's similarity \citep{ghosh2020efficient} and its variations \citep{xie2021multi, briggs2020federated, duan2021fedgroup, mansour2020three}.
%and variations that include defining clusters based on the distance between clients' model updates \citep{xie2021multi, briggs2020federated}, gradients \citep{duan2021fedgroup}, and training loss \citep{mansour2020three}. 
In soft clustered FL, one instead assumes that each client's data conforms to a mixture of distributions \citep{marfoq2021federated,ruan2022fedsoft, wu2023personalized}. % We extend such soft-clustered FL into a decentralized setting. 
Like these prior works, we use models learned for each cluster as guides for a personalized model; unlike them, we add a final personalization step to ensure good performance. We discuss this comparison in more detail in Section~\ref{sec:algorithms}. %\carlee{Like these prior works, we use models learned for each cluster as guides to learn a personalized model; however, unlike them, we utilize post-training finetuning to derive the personalized models from the cluster models. We discuss this comparison in more detail in Section~\ref{sec:personalization}.}

Some prior works have considered \textbf{combining personalization and DFL}. \citet{jeong2023personalized} proposed a distillation-based algorithm, while \citet{9993756} proposed a communication-efficient algorithm with model pruning and neighbor selection. \citet{sadiev2022decentralized} prove lower bounds on personalized DFL algorithms' convergence under specific objectives. Unlike these works, we provide theoretical convergence guarantees under more general learning objectives. Some centralized personalization algorithms also include decentralized versions, such as \textbf{FedEM} \citep{marfoq2021federated} and \textbf{IFCA} \citep{ghosh2020efficient}. We experimentally show (Section~\ref{sec:simulation}) that \textbf{\algname}~outperforms both \textbf{FedEM} and \textbf{IFCA}, particularly in low-connectivity settings. Moreover, we \textit{only require each client to train one cluster model at a time}, which leads to significantly smaller computational and communication overhead than \textbf{FedEM}. % PDFL is an active research topic, more and more works are being published recently.

\textbf{Comparison with FedSoft.} \textbf{\algname}~was inspired by \textbf{FedSoft} \citep{ruan2022fedsoft}. However, the training methodology is significantly different. \textbf{FedSoft} uses a proximal objective and all client data to update its model in each round, while our \textbf{\algname}~maintains separate models for each cluster and has each client update only one of these models, using only data associated with that cluster, in each round. % keeps separate models and only uses the data associated with the same cluster to do the training. 
Thus, \textbf{\algname}~avoids bias in gradient updates, which may hamper consensus in decentralized settings. Our theoretical convergence analysis also relaxes the assumptions made by \citet{ruan2022fedsoft} in analyzing \textbf{FedSoft}. We provide a more detailed comparison in Appendix \ref{sec:comp}.