\subsection{Experiment Details}
The following shows the detailed settings of our experiments. \new{We largely follow \cite{marfoq2021federated, ruan2022fedsoft} in our experiment settings.} %\carlee{We largely follow [cite FedEM, other relevant papers] in our experiment settings.}

\subsubsection{MNIST/EMNIST Data}
%Half of the data were selected to do a 90 degrees rotation. Each client get the same number of data but the portion of the rotated data and non-rotated data is randomly ranging from 10\% to 90\% respectively. The number of clients is set to be $N=100$ in the comparison with baselines. We use a CNN model comprising of two convolutional layers. Their kernel size and padding are equal to 5 and 2 respectively. Each convolutional layer is followed by the max-pooling layer with kernel size equal to 2. After the convolutional layers, it followed by fully-connected layers with dropout of size 50. The activation function is ReLU. All clients use SGD as solver. The number of local epochs is equal to 5, while the initial step have doubled local epochs. The initial learning rate is set to 5e-2 with decaying factor being 0.80. The training lasts for 150 global epochs.
Half of the dataset was selected to undergo a 90-degree rotation. Each client received the same amount of data, but the ratio of rotated to non-rotated data was set uniformly at random in the range from 10\% and 90\%. The number of clients was fixed at $N = 100$ for comparison with the baselines. A CNN (convolutional neural network) model was employed, consisting of two convolutional layers with kernel size and padding set to 5 and 2, respectively. Each convolutional layer was followed by a max-pooling layer with a kernel size of 2. After the convolutional layers, fully connected layers were used, with a dropout layer of size 50. The ReLU activation function was applied \new{to each convolution layer and fully-connected layer.} %\carlee{to each fully connected layer?}. 
All clients utilized SGD as the optimizer. The number of local epochs was set to 5, with the initial step having double the local epochs \new{to accelerate the initial learning, leading to a faster reduction in global loss.} %\carlee{This sounds unusual; can we give a short justification?}. 
The initial learning rate was 5e-2, with a decay factor of 0.80. Training was carried out over 150 global epochs. Regarding network topology, unless otherwise specified, ER Random Graph with connecting probability $p=0.06$ and total number of clients $N=100$ were used in the experiments. The results in the table are averaged over five individual experiments.

\subsubsection{CIFAR-10 \& CIFAR-100 Data}
% Data are divided into even and odd labels. We also randomly select half of the data being rotated by 90 degrees. This potentially create 4 different distributions of data. Each client get the same number of data but the portion of the odd-labeled data and even-labeled data is randomly ranging from 10\% to 90\% respectively. The number of clients is set to be $N=25$ in the comparison with baselines. We use a CNN model comprising of four convolutional layers. The first two layers with kernel size and padding equal to 5 and 2 respectively. The last two layers with kernel size and padding equal to 3 and 1 respectively. Each convolutional layer is followed by batch normalization. After the second and fourth convolutional layers, it followed by max-pooling layer with kernel size equal to 2 and the dropout layer. After the convolutional layers, it followed by two fully-connected layers with dropout and batch normalization with 1024 and 512 hidden neurons respectively. The activation function is ReLU. All clients use SGD as solver. The number of local epochs is equal to 5, while the initial step have doubled local epochs. The initial learning rate is set to be 5e-2 with decaying factor being 0.85. The training lasts for 150 global epochs.
The dataset was divided into even and odd labels \new{by its number of label marked in the dataset}%\carlee{how did you determine which labels were ``even'' and which ``odd''?}
, and half of the data was randomly selected to undergo a 90-degree rotation. This process potentially created four different data distributions (rotated even, un-rotated even, rotated odd, un-rotated odd). Each client received an equal amount of data, but the proportion of odd-labeled and even-labeled data was randomly assigned, ranging uniformly at random from 10\% to 90\%. The number of clients was set to $N = 25$ for comparison with the baselines. A CNN model with four convolutional layers was used. The first two layers had a kernel size and padding of 5 and 2, respectively, while the last two layers had a kernel size and padding of 3 and 1, respectively. Each convolutional layer was followed by batch normalization. After the second and fourth convolutional layers, max-pooling with a kernel size of 2 and a dropout layer were applied. Following the convolutional layers, two fully connected layers with dropout and batch normalization were used, containing 1024 and 512 hidden neurons, respectively. The activation function %\carlee{on each layer?} 
was ReLU \new{on each layer.} All clients used SGD as the optimizer. The number of local epochs was set to 5, with the initial step doubling the local epochs. The initial learning rate was set to 5e-2, with a decay factor of 0.85. Training was conducted for 150 global epochs. Regarding network topology, unless otherwise specified, ER Random Graph with connecting probability $p=0.20$ and total number of clients $N=25$ were used in the experiments. The results in the table are averaged over two independent runs, except for FedSoft, which was only run once due to its extensive runtime.

% For the code for FedSPD, please refer to the \href{https://github.com/Anonymous-Submission-for-AISTATS/FedSPD_AISTATS2025}{Link}.

For the code for FedSPD, please refer to the \href{https://github.com/Anonymous-Submission-for-AISTATS/FedSPD_Anonymous_Submission}{Link}.

\subsection{Additional Simulation Results}
\subsubsection{Varying the Number of Local Epochs} \label{sec:fl_localep}
%We ran the experiments for 150 epochs on MNIST, CIFAR-10 and CIFAR-100 datasets. As shown in Figure \ref{fig:TA_LE}, a larger number of local epochs results in faster convergence. For \(\tau=1\), the training does not converge even after 150 epochs for MNIST. And for CIFAR-10 and CIFAR-100,  it seems to converge to a lower training accuracy. We can observe that when the dataset and model become more complicated, increasing local epochs may help to increase the performance. Table \ref{tab:tau} shows the final testing accuracies of different number of local epochs under different datasets. For MNIST, the testing accuracies are 93.27\% and 93.47\%, respectively, showing only a slight difference. This may due to the MNIST dataset being too simple. In CIFAR-10, the testing accuracy of (\(\tau=5\) and \(\tau=10\)) are 70.61\% and 66.52\%, respectively, where larger number of local epochs in this case actually lower the final performance. However, in CIFAR-100, \(\tau=10\) is actually the case with the best performance. We can observe that when the data become more complicated, a larger number of local epochs may be a better choice, which is the same conclusion from looking at the training accuracy curves. However, it is important to note that setting \(\tau\) too high is not advisable, as it may lead to overfitting to the local data, which is the case in \(\tau=10\) for the CIFAR-10 dataset.
We conducted experiments for 150 epochs on the MNIST, CIFAR-10, and CIFAR-100 datasets. As shown in Figure \ref{fig:TA_LE}, increasing the number of local epochs in \textbf{\algname} leads to faster convergence. For \(\tau=1\), the training did not converge even after 150 epochs on MNIST, and for CIFAR-10 and CIFAR-100, it seemed to converge to a lower training accuracy. We observed that as the dataset and model complexity increased, increasing the number of local epochs tended to improve performance. 

Table \ref{tab:tau} presents the final \textbf{\algname} testing accuracies for different numbers of local epochs across the datasets. On MNIST, the testing accuracies were 93.27\% and 93.47\%, respectively, showing only a slight difference, likely because the MNIST dataset is relatively simple, so the learning hyperparameters do not make much of a difference in model performance. For CIFAR-10, the testing accuracies for \(\tau=5\) and \(\tau=10\) were 70.61\% and 66.52\%, respectively, where a larger number of local epochs actually reduced the final performance. However, for CIFAR-100, \(\tau=10\) resulted in the best performance. This suggests that for more complex datasets, a higher number of local epochs can be beneficial, as indicated by the training accuracy curves. Nevertheless, it is important to note that setting \(\tau\) too high may lead to overfitting to the local data, as was the case with \(\tau=10\) on the CIFAR-10 dataset. \new{These findings are consistent with known results in general federated learning, where a higher number of local epochs can effectively increase the number of gradient steps taken, accelerating convergence as long as the local models do not diverge too much due to a large number of local steps.}

\begin{figure*}[h!]
\begin{subfigure}{0.32\linewidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/TA_MN.png}
    \caption{\sl Training accuracy on MNIST.}
    \label{fig:TA_MN}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/TA_CF10B.png}
    \caption{\sl Training accuracy on CIFAR-10.}
    \label{fig:TA_C10}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/TA_CF100B.png}
    \caption{\sl Training accuracy on CIFAR-100.}
    \label{fig:TA_C100}
\end{subfigure}
\caption{\sl \textbf{\algname} training accuracy with different numbers of local steps $\tau$. When the data become more complicated, increasing local epochs may be a better choice.}
\label{fig:TA_LE}
\end{figure*}

\begin{table}[h!]
  \centering
\begin{tabular}
{|p{3.0cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|}
\hline  Local Epochs & 1 & 5 & 10 \\
\hline MNIST & 74.20 & 93.27 & 93.47 \\
\hline CIFAR-10 & 41.34 & 70.61 & 66.52 \\
\hline CIFAR100 & 19.86 & 43.35 & 44.99 \\
\hline
\end{tabular}
\caption{\sl Final \textbf{\algname} testing accuracies for different number of local epochs.}
  \label{tab:tau}
\end{table}

\subsubsection{Influence of the Final Phase} \label{sec:fl_final}
% Our \algname~ consists of the final phase after the typical training. The optimal number of epochs required for this final phase varies with different datasets and learning model. Due to the simplicity of EMNIST and its model, the testing accuracy is high enough after aggregation. In our settings of EMNIST, around 10 epochs for the final phase can increase the performance by 0.5\% and after 10 epochs the testing accuracy remains at the same level. For CIFAR-10 and CIFAR-100, the testing accuracy increase by 7\% and 6\% respectively after 15 epochs. Around 30 epochs is good enough to reach an optimal performance for both datasets. It is important to note that setting the correct number of epochs and the learning rate of this final phase is important. Too many epochs and learning rate that is too high (or the decay of the learning rate is too low) are not advisable, as it may lead to overfitting to the local data. Keep in mind that this final phase is being trained locally instead, thus not using any communication resources, which is an advantage of our \textit{\algname}~algorithms. Note that for EMNIST, CIFAR-10 and CIFAR100, our \textit{\algname}~already has higher accuracies compare to other methods without this final phase. Other algorithms like FedEM, already do the aggregation during the regular training phase. Adding extra local rounds may cause overfitting.
Our \textbf{\algname}~algorithm uses a final phase that follows the typical federated learning training process. The optimal number of epochs for this final phase varies depending on the dataset and learning model. Due to the simplicity of EMNIST and its model, the testing accuracy is already sufficiently high after aggregation. In our EMNIST setup, using %approximately \carlee{why approximately?} 
10 epochs in the final phase increases performance by 0.5\%, and beyond 10 epochs, the testing accuracy stabilizes. For CIFAR-10 and CIFAR-100, the testing accuracy improves by 7\% and 6\%, respectively, after 15 epochs. Around 30 epochs are sufficient to achieve optimal performance for both datasets. It is important to note that choosing the correct number of epochs and learning rate for this final phase is crucial. Too many epochs, or a learning rate that is too high (or with insufficient decay), may lead to overfitting to the local data. Since this final phase is trained locally without any communication overhead, it presents a key advantage of our \textbf{\algname}~algorithm \new{in communication-constrained settings}. Additionally, note that for EMNIST, CIFAR-10, and CIFAR-100, our \textbf{\algname}~already achieves higher accuracies compared to other methods, even without this final phase. Other algorithms like \textbf{FedEM} perform aggregation during the regular training phase, so adding extra local rounds in a final phase of training may lead to overfitting.


\begin{figure*}[h!]
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/TT_EM.png}
    \caption{\sl Testing accuracy on EMNIST.}
    \label{fig:TT_EM}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/TT_C10.png}
    \caption{\sl Testing accuracy on CIFAR-10.}
    \label{fig:TT_C10}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/TT_C100.png}
    \caption{\sl Testing accuracy on CIFAR-100.}
    \label{fig:TT_C100}
\end{subfigure}
\caption{\sl Testing accuracy of the final phase.}
\label{fig:TT_FP}
\end{figure*}

\subsubsection{Influence of the Number of Clusters} \label{sec:fl_cluster}
%The testing accuracy with different hyperparameter $S$ (number of clusters) for CIFAR-10 and CIFAR-100 datasets is shown in Figure \ref{fig:cns}. In the experiment settings, we potentially creating 4 different distributions with different labels and image rotation. In our \algname, setting the $S$ being too high may not increase the performance. This may due to most practical loss functions, such as cross-entropy for neural networks, are not convex, this aggregated model need not perform optimally in practice. Aggregating more models in the final phase may exacerbate this problem. However, in our \algname~algorithms, setting $S=2$ already have a great performance in terms of he final test accuracy.
The testing accuracy with different hyperparameters $S$ (number of clusters) for the CIFAR-10 and CIFAR-100 datasets is shown in Figure \ref{fig:cns}. In the experimental settings, we potentially created four different distributions by using varying labels and image rotations. In our \textbf{\algname}~algorithm, setting $S$ too high does not necessarily improve performance. This may be because most practical loss functions, such as the cross-entropy used in neural networks, are non-convex, meaning that the aggregated model may not perform optimally in practice. Aggregating more models in the final phase can exacerbate this issue. However, in our \textbf{\algname}~algorithm, setting $S=2$ already gives excellent performance in terms of the final test accuracy.

\begin{figure}
    \centering
    \includegraphics[width=0.5\linewidth]{Styles/Fig/CLS.png}
    \caption{\textbf{\algname} test accuracy for different numbers of clusters $S$.}
    \label{fig:cns}
\end{figure}

\subsubsection{Extra Details for Experiments with Different Graph Connectivity} \label{sec:fl_edcon}
%The training accuracy versus epochs for MNIST on each topology is shown in Figure \ref{fig:topology}. We can observe that usually, the networks with lower connectivity converge slower than the ones with higher connectivity. We also observe that RGG has more oscillations compared to other topologies which may due to the high clustering effect \citep{penrose2003random}. However, eventually they reach the same level of training accuracy no matter which kind of network topology we use, indicating that, as we expect from Theorem~\ref{thm:4}, \textit{\algname}~converges regardless of the network topology as long it is connected.
\textbf{\algname}'s training accuracy versus epochs for MNIST across different topologies is shown in Figure \ref{fig:topology}. We observe that networks with lower connectivity typically converge more slowly than those with higher connectivity, in each topology. Additionally, RGG exhibits more oscillations compared to other topologies, likely due to its high clustering effect \citep{penrose2003random}. However, all topologies eventually reach the same level of training accuracy, regardless of the network structure, indicating that, as predicted by Theorem~\ref{thm:4}, \textbf{\algname}~converges as long as the network is connected.


\begin{figure*}[h!]
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/ER.jpg}
    \caption{\sl Training accuracy of ER Graph.}
    \label{fig:ER}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/BA.png}
    \caption{\sl Training accuracy of BA Model.}
    \label{fig:BA}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/RGG.png}
    \caption{\sl TrainingaAccuracy of RGG.}
    \label{fig:RGG}
\end{subfigure}
\caption{\sl \textbf{\algname}~converges slightly faster on networks of higher average degree, with noisier convergence on highly clustered RGG graphs, on MNIST Data.}
\label{fig:topology}
\end{figure*}

In addition to static network topology settings, we evaluate the performance of our \textbf{\algname} algorithm under dynamic network conditions. In the following experiments, we use CIFAR-100 with 25 clients, initializing the network with an Erdős--Rényi (ER) random graph. To simulate a dynamic network, at each epoch, every existing edge has a probability $p$ of being removed, while each non-existent edge has a probability $p_{\text{add}}$ of being added. The value of $p_{\text{add}}$ is adjusted at each epoch to maintain a roughly constant average connectivity across the network. A larger value of $p$ corresponds to a more dynamic network topology over time. The results are summarized in Table \ref{tab:dynet}. From the results, we observe that network dynamics have little effect on performance---our \textbf{\algname} consistently maintains its effectiveness across different edge removal probabilities $p$.

\begin{table}[h!]
  \centering
\begin{tabular}
{|p{2.4cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.6cm}|}
\hline  $p$ & 0.3 & 0.2 & 0.1 & 0 (Static) \\
\hline Test Accuracy & 37.56 & 37.22 & 37.42 & 37.14 \\
\hline
\end{tabular}
\caption{\sl Performance of Dynamic Network Topology.}
  \label{tab:dynet}
\end{table}


\subsubsection{Impact of Data Quantity Imbalance Across Clients}\label{sec:IDA}

In addition to considering data imbalance across clusters, we further evaluate the performance of our method under a more challenging setting where both inter-cluster imbalance and total data imbalance across clients are present. Specifically, we conduct experiments using the CIFAR-100 dataset with the same configuration as described in Figure~\ref{fig:CFCN}. To simulate varying amounts of data per client, we categorize clients into three groups: low, average, and high data holders. Let $r$ denote the ratio of data volume between clients with the highest and lowest data quantities.

\begin{figure*}[h!]
\centering
\begin{subfigure}{0.49\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/UDNA.png}
    \caption{\sl Average Test accuracy with unbalanced data amount.}
    \label{fig:UDNA}
\end{subfigure}
\begin{subfigure}{0.49\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/UDNB.png}
    \caption{\sl Box plot of test accuracy across clients.}
    \label{fig:UDNB}
\end{subfigure}
\caption{\sl Test accuracy with unbalanced data amount.}
\label{fig:UDN}
\end{figure*}

The results of this experiment are presented in Figure~\ref{fig:UDN}. We observe that the average accuracy remains stable as the imbalance ratio $r$ increases. Notably, even under the most skewed setting ($r=9$), the clients with the lowest test accuracy achieve approximately $30\%$ accuracy—substantially higher than the performance of local training under uniform data allocation, which yields only around $14\%$ accuracy. This demonstrates that clients with limited data can significantly benefit from collaborative training and knowledge sharing with other clients.

\subsubsection{Experiments Incorporating Differential Privacy (DP)}\label{sec:dp}

We follow \cite{wei2020federated} conduct the experiments on MNIST dataset with 50 clients. The parameters of DP is selected as follow: Clipping Threshold $C = 1$, $\delta = 0.01$ thus $c$ is chosen to be $\sqrt{2 \ln \frac{1.25}{0.01}}$. We select $\epsilon$ to be 10, 50 and 100 to do the experiment. Table \ref{tab:dp} shows the results of different settings with two different accuracies. One is the test accuracy of our \textbf{\algname} right after the model aggregation. The other one is the accuracy after 10 local epochs of our final phase. The reason to include 2 different accuracies is because the final phase is local training, need not to do the DP. If we only show the accuracy after the final phase, the influence of DP might not be clear.

\begin{table}[h!]
  \centering
\begin{tabular}
{|p{5.0cm}|p{2cm}|p{2cm}|p{2cm}|p{2cm}|}
\hline  Metrics & No DP & DP ($\epsilon=100$) & DP ($\epsilon=50$) & DP ($\epsilon=10$) \\
\hline Test Accuracy (Post Aggregation) & 92.51 & 92.75 & 92.46 & 92.13 \\
\hline Test Accuracy (After Final Phase) & 93.89 & 93.99 & 93.87 & 93.70 \\
\hline
\end{tabular}
\caption{\sl Results with DP on MNIST dataset.}
  \label{tab:dp}
\end{table}

From the results, we see that our \textbf{\algname} algorithms combine perfectly with DP. The accuracies keep at a high level with different settings. The $\epsilon=100$ case even have a slightly higher accuracy compare to the case without DP. This may be potentially due to a moderate additive noise actually preventing over-fitting in certain level. Another observation is that, actually the final phase of our algorithms do enhance the model and reduce the gap of the test accuracies across different settings. This is another evidence showing that the final phase of our \textbf{\algname} do further personalize the local model well.

\subsubsection{Experiments Using MobileNet-v2}\label{sec:mobilenet}
We conducted experiments on the CIFAR-100 dataset and the CIFAR-10 dataset and its mixtures with MNIST and FashionMNIST, using MobileNet-v2 as the machine learning model. The mixed datasets were created by sampling 25,000 data points from CIFAR-10 and 25,000 from MNIST/FashionMNIST. Each client randomly drew between 10\% and 90\% of its data from one of the sampled datasets, with the remainder sourced from the other. The network topology was modeled as an Erdős–Rényi (ER) random graph with a connection probability of $p = 0.20$ and a total of 20 clients. The results are presented in Table~\ref{tab:mb}.

For CIFAR-100, our method outperforms other methods. However in the data mixture settings, the results indicate that as the model size and complexity increase, FedAvg outperforms all other algorithms. As models and datasets become more complex, personalization methods may occasionally exhibit reduced performance due to many reasons. This may be attributed to the model’s expressiveness, which allows it to effectively capture variations across different clients. In this case, the global model is sufficiently robust and more effective than personalizing to local data distributions. Similar results were also observed in the decentralized FedEM \cite{marfoq2021federated}, where performance degradation occurred under specific conditions in their experiments. In specific, their decentralized FedEM performs worse than FedAvg on FEMNIST and CIFAR-10.

Our proposed method, \textbf{\algname}, experiences greater challenges in such scenario, as it splits the local dataset into two clusters for separate training and postpone the aggregation. However, when the distributions of the two clusters differ significantly, such as in the mixture of CIFAR-10 with MNIST or FashionMNIST, \textbf{\algname} achieves faster convergence, as demonstrated in Figures~\ref{fig:mb-mm} and~\ref{fig:mb-mf}. This highlights \textbf{\algname}'s ability to accurately distinguish data sampled from different distributions, showing that \textbf{\algname} can be trained efficiently when the communication/computing resources are limited.

\begin{table}[htbp]
  \centering
\begin{tabular}
{|p{4.5cm}|p{1.5cm}|p{1.5cm}|p{1.5cm}|p{1.5cm}|p{1.5cm}|p{1.5cm}|}
\hline   & \textbf{FedSPD} & FedEM & IFCA & pFedMe & FedAvg & Local \\
\hline CIFAR-10 & 72.14 & 78.02 & 78.56 & 61.00 & 79.01 & 57.72\\
\hline CIFAR-10 + MNIST & 86.20 & 88.88 & 89.08 & 74.64 & 89.27 & 77.76\\
\hline CIFAR-10 + FashionMNIST & 80.04 & 85.01 & 85.47 & 69.37 & 85.58 & 72.74\\
\hline CIFAR-100 & 46.13 & 42.29 & 44.48 & 32.77 & 45.70 & 18.52\\
\hline
\end{tabular}
\caption{\sl Results using MobileNet-v2.}
  \label{tab:mb}
\end{table}

\begin{figure*}[h!]
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/MCF.png}
    \caption{\sl Training Accuracy of Single CIFAR-10 Dataset.}
    \label{fig:mb-cf}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/MMM.png}
    \caption{\sl Training Accuracy of Mixture of CIFAR-10 + MNIST.}
    \label{fig:mb-mm}
\end{subfigure}
\hfill
\begin{subfigure}{0.32\textwidth}
    \centering
    \includegraphics[width=1.0\linewidth]{Styles/Fig/MMF.png}
    \caption{\sl Training Accuracy of Mixture of CIFAR-10 + FashionMNIST.}
    \label{fig:mb-mf}
\end{subfigure}
\caption{\sl Experiments on Various Datasets Using MobileNet-v2.}
\label{fig:mobile}
\end{figure*}
