
\section{Problem Formulation}\label{sec:formulation}

We illustrate our \textbf{system model} in Figure \ref{fig:data} and summarize our notation in Table~\ref{tab:notation}. We suppose there are $N$ clients that are connected to each other via a graph with adjacency matrix $\mathbf{A}$ and use $\mathcal{N}_i$ to denote the set of client $i$'s neighbors. Each client $i = 1,2,\ldots,N$ has a fixed set $\mathcal{D}_i$ of training data. Clients with a shared edge can directly communicate with each other, e.g., to send model parameters. %\carlee{Each client $i = 1,2,\ldots,N$ has a fixed set $D_i$ of training data.}

\begin{table*}[bht]
  \centering
\begin{tabular}{|p{4.2cm}|p{1.2cm}|p{3.2cm}|p{6.8cm}|}
\hline \textbf{Name} & \textbf{Notation} & \textbf{Domain} & \textbf{Description} \\
\hline Number of Clients / Clusters & $N, S$ & $N, S \in \mathbb{N}$ & Total number of clients / clusters \\
\hline Learning Rate & $\eta_t$ & $\eta_t \in \mathbb{R}, 0<\eta<1$ & Learning rate used in round $t$ \\
\hline Number of Local Updates & $\tau$ & $\tau \in \mathbb{N}$ & Number of local updates in each training round \\
\hline Client Neighbors & $\mathcal{N}_i$ & $\mathcal{N}_i \in \mathcal{P}(N)$ & Indices (in $\left\{1,2,\ldots,N\right\}$) of client $i$'s neighbors \\
%\hline Cluster Number & $S$ & $S \in \mathbb{N}$ & Total number of clusters \\
%\hline Number of Clients & $N$ & $N \in \mathbb{N}$ & Total number of clients \\
%\hline Number of Parameters & $X$ & $X \in \mathbb{N}$ & Total number of parameters \\
% \hline Total Rounds & $T$ & $T \in \mathbb{N}$ & Total rounds of the task \\
\hline Final Model Parameters & $\mathbf{x}_i$ & $\mathbf{x}_i \in \mathbb{R}^{1\times X}$ & Final personalized model parameters of client $i$\\
\hline Final Concatenated Model Parameters & $\mathbf{X}$ & $\mathbf{X} \in \mathbb{R}^{N\times X}$ & Concatenated personalized model parameters\\
\hline Final Phase Epochs & $\tau_{final}$ & $\tau_{final} \in \mathbb{N}$ & Number of epochs for the final phase\\
\hline Local Dataset & $\mathcal{D}_{is}^t$ & $\mathcal{D}_{is}^t \subseteq \mathcal{D}_{i}$, client $i$'s data & Data points at client $i$ associated with cluster $s$ in round $t$ \\
\hline Cluster Selection & $s_{i}^t$ & $s_{i}^t \in \left\{1,2,\ldots,S\right\}$ & Index of cluster that client $i$ trains in round $t$ \\
\hline Portion of Clusters & $u_{is}^t$ & $u_{is}^t \in \mathbb{R}, 0 < u_{is} \leq 1$ & Portion of data for client $i$ of cluster $s$ in round $t$\\
\hline Concatenated Portions of Clusters & $\mathbf{U}(t)$ & $\mathbf{U} \in \mathbb{R}^{N\times S}$ & Concatenated portions of data of all clients in round $t$\\ % \carlee{capitalized or not?} \\
% \hline Cluster Centers & $\mathbf{c}_{is}^t$ & $\mathbf{c}_{is}^t \in \mathbb{R}^{X}$ & Centers of cluster $s$ of client $i$ at time $t$ \\
\hline Average Cluster Centers & $\overline{\mathbf{c}}_{s}^t$ & $\overline{\mathbf{c}}_{s}^t \in \mathbb{R}^{X}$ & Average center of cluster $s$ over clients in round $t$ \\
\hline Concatenated Cluster Centers & $\mathbf{C}_s^t$ & $\mathbf{C}_s^t \in \mathbb{R}^{N\times X}$ & Concatenated centers of cluster $s$ in round $t$ \\
\hline Collection of Cluster Centers & $\mathcal{C}(t)$ & $\mathcal{C}(t) \in \mathbb{R}^{S\times N \times X}$ & $\mathcal{C}(t) = \{ \mathbf{C}_1^t, \mathbf{C}_2^t, ..., \mathbf{C}_S^t \}$ \\
\hline Weight Matrix & $\mathbf{W}_s^t$ & $\mathbf{W}_s^t \in \mathbb{R}^{N\times N}$ & Weight matrix of cluster $s$ in round $t$ \\
\hline Augmented Adjacency Matrix & $\mathbf{A}$ & $\mathbf{A} \in \mathbb{R}^{N\times N}$ & Augmented adjacency matrix with diagonal elements equal to 1 \\
\hline Concatenated Gradients & $\mathbf{G}_s^t$ & $\mathbf{G}_s^t \in \mathbb{R}^{N\times X}$ & Concatenated gradients in round $t$ for cluster $s$, $\mathbf{G}_s^t := [\nabla F_1, ..., \nabla F_N]$\\
\hline
\end{tabular}
\caption{\sl Mathematical notations used in the paper. % \carlee{added $\mathcal{N}_i$, $s_i^t$, $\mathcal{D}_i$, $\mathcal{D}_{is}^t$ and took out $S,N,X,T$}
}
  \label{tab:notation}
\end{table*}

\begin{figure}[t]
\centering
    \includegraphics[width=0.38\textwidth]{Styles/Fig/Data.png}  
\caption{\sl Illustration of the mixture of data distribution at clients in DFL.}
\label{fig:data}
\end{figure}

Each data point $d\in \mathcal{D}_i$ on each client $i$ is randomly sampled from one of $S$ unique probability distributions (clusters) denoted as $P_1, P_2, \ldots P_S$, as illustrated in Figure \ref{fig:data}. Consistent with standard clustering methods, we take $S$ as a predetermined hyperparameter \citep{ruan2022fedsoft}. 
Letting $\mathbf{x}$ denote the parameters of a machine learning model, we define the loss function $\ell (\mathbf{x}; \mathcal{D})$ as measuring the sum of the model losses with parameters $\mathbf{x}$ over all points $d$ in a dataset $\mathcal{D}$. Cross-entropy loss, for example, is a typical loss function for classification problems. The \textit{risk} of cluster $s$ can then be written as:
%
%\begin{equation}
    $F_s(\mathbf{x})=\mathbb{E}_{\mathcal{D}\sim P_s}[\ell (\mathbf{x}; \mathcal{D})]$.
%\end{equation}
%
For each client, the risk on a data point $d_{is}$ belonging to cluster $s$ is defined as: $f_{is}(\mathbf{x}, d_{is}) = \ell (\mathbf{x}, d_{is})$. Our goal is for the clients to collectively find the optimal (i.e., risk-minimizing) model parameters for each cluster, which we also call the \textit{cluster centers} and can be written as:
%\begin{equation}
$\mathbf{c}_s^*=argmin_{\mathbf{x}}F_s(\mathbf{x}), \text{ for } s=1, 2, ..., S$.
%\end{equation}
%\subsection{Objective}
%Given knowledge of the cluster centers and mixture coefficients $u_{is}$, defined as the portion of the data from cluster $s$ at the $i$-th client, each client can then find a personalized model for its local data mixture, as we explain in the next section. By focusing on clients' finding (common) cluster centers instead of exclusively focusing on finding personalized models, we can still view the problem of personalized learning as one of finding consensus on the cluster centers, solving one of the major challenges of personalized decentralized FL. However, clients cannot directly solve for the cluster centers using their local training data $\mathcal{D}_i$, as their data comes from a \textit{mixture} of the clusters, and they do not initially know which data points have come from which cluster. Thus, in the next section we present an algorithm for clients to estimate the cluster centers and corresponding personalized models.

% Given knowledge of the cluster centers and mixture coefficients \( u_{is} \), which represent the portion of data from cluster \( s \) at client \( i \), each client can find a personalized model for its local data mixture, as explained in the next section. By focusing on finding common cluster centers, we can still view personalized learning as finding consensus on the cluster centers, addressing a major challenge in personalized decentralized FL. However, clients cannot directly solve for the cluster centers using their local training data \(\mathcal{D}_i\), as their data is a \textit{mixture} of clusters, and they do not initially know which data points belong to which cluster. Thus, in the next section, we present an algorithm for clients to estimate the cluster centers and use these centers to derive personalized models.
Given the cluster centers and mixture coefficients \( u_{is} \), which represent the proportions of each cluster \( s \) in each client \( i \)'s data, each client can find a personalized model for its local data mixture (Section~\ref{sec:algorithms}). By focusing on common cluster centers, personalized learning can be reframed as achieving consensus on these centers, addressing a key challenge in personalized DFL. However, clients cannot directly determine the cluster centers using their local data \(\mathcal{D}_i\) since it is a \textit{mixture} of clusters, and they do not know which of their data comes from which cluster. In the next section, we present an algorithm for clients to estimate the cluster centers and use them to derive personalized models.
% We thus define $u_{is}$ as the portion of the data from cluster $s$ at the $i$-th client. The \textbf{objective function} of each client $i$ at time $t$ is then:
% \begin{equation}
%     h_{i}(t) = f_{i}(\mathbf{x}_i^t; \mathcal{D}_{i, s}^t).
% \end{equation}
% Here $\mathbf{x}_i^t = \mathbf{c}_{i, s}^{t}$, is the cluster center of cluster $s$ that client $i$ keeps at time $t$. (This indicates that the cluster centers each client kept can be different.) $s$ is the selected cluster and $f_i$ is the risk for each client, which can be decomposed as combination of different cluster risks.
% \carlee{Is this still true after you changed the algorithm?}
% \subsection{Parameters}
%\subsection{Mathematical Notations}
%The mathematical notations used throughout this paper is listed in Table \ref{tab:notation}.

