\section{Numerical Experiments}\label{sec:num_exp}
We discuss numerical experiments that examine the performance of our algorithms and compare them to SOTA baselines. Unless otherwise stated, all numbers reported are averages over 50 repetitions with randomized data shuffling, and all error bars or confidence intervals represent one standard deviation. We use equivalent client weights $\alpha_g$ and local hyperparameters for all clients. Additional experimental details and a \textbf{scalability experiment} are provided in Appendix \ref{app:exp}. All code is available at \href{https://github.com/mibrahim41/FDR-SVM}{this link}.


\textbf{Our Methods.} Our algorithms include: i) \textit{SM}: the FDR-SVM model trained via the SM algorithm in \ref{alg:subgrad} with a diminishing step-size, ii) \textit{ADMM}: the FDR-SVM model trained via the ADMM algorithm in \ref{alg:admm}, and iii) \textit{ADMM-SC}: the FDR-SVM model with modified client objectives according to Theorem \ref{thm:admm_conv_cri}, trained via the ADMM-SC algorithm in \ref{alg:admm}.

\subsection{UCI Data Experiment}\label{sec:real}
This experiment compares the performance of our methods to various SOTA benchmarks. We use $G=4$ clients for all datasets. Performance is measured in terms of F-1 score. 

\textbf{Datasets.} We utilize 7 popular dataset from the UCI repository. For all datasets $70\%$ of the samples are used for training and the remainder is used for testing. 

\textbf{Baselines.} We use the DR SVM model by \cite{2019regularization} as a centralized benchmark model. For federated baselines, we compare to the popular \texttt{FedSGD}, \texttt{FedAvg} \citep{mcmahan2017}, and \texttt{FedProx} \citep{li2020} used to train an $\ell_2$-squared regularized SVM. We also compare to \texttt{FedDRP} \citep{Khanduri2023} used to train a DR-SVM with a KL divergence ambiguity set.

\textbf{Hyperparameters.} We tune the Wasserstein radius $\varepsilon$ and label-flipping cost $\kappa$ for the centralized baseline, and the initial learning rate $\gamma(0)$ and number of rounds $T$ for all federated baselines. We also tune the number of rounds $T$ and the hyperparameters $\rho$ and $\gamma$ for our methods. We utilize 5-fold cross-validation for hyperparameter tuning.

\begin{table*}[ht]
\centering
\caption{F-1 Score Attained by Classification Models on 7 UCI Datasets.}
\label{tab:real_world}
\smaller
\begin{tabular}{lccccccc}
\toprule
Model    & \multicolumn{1}{c}{Banknote} & \multicolumn{1}{c}{BCW} & \multicolumn{1}{c}{CB} & \multicolumn{1}{c}{MM} & \multicolumn{1}{c}{Parkinson's} & \multicolumn{1}{c}{Rice} & \multicolumn{1}{c}{UKM} \\ \midrule
Central (DR-SVM)  &$\mathbf{.950}\pm .011$&$.964 \pm .013$&$.773 \pm .052$& $.792 \pm .017$&$.904 \pm .025$&$\mathbf{.938} \pm .005$& $.845 \pm .027$  \\ \midrule
FedSGD ($\ell_2$-SVM)  &$\mathbf{.950} \pm .011$&$.914 \pm .019$&$.765 \pm .045$&$.624 \pm .161$&$.752 \pm .204$&$.856 \pm .013$& $.808 \pm .019$\\
FedAvg ($\ell_2$-SVM) &$\mathbf{.950} \pm .011$&$.929 \pm .016$&$.788 \pm .051$&$.787 \pm .024$&$.816 \pm .132$&$.936 \pm .006$& $.847 \pm .027$ \\
FedProx ($\ell_2$-SVM) &$\mathbf{.950} \pm .011$&$.929 \pm .019$&$.780 \pm .075$&$.782 \pm .052$&$.735 \pm .210$&$.931 \pm .006$& $.847 \pm .028$ \\ 
FedDRO (KL) &$.945 \pm .011$& $.925 \pm .019$ & $.738 \pm .036$ &$.783 \pm .019$&$.864 \pm .025$& $.859 \pm .009$&$.718 \pm .001$ \\
\midrule
SM (FDR-SVM)  &$.855 \pm .017$&$.957 \pm .015$&$.769 \pm .054$&$.797 \pm .023$&$\mathbf{.920} \pm .023$&$.936 \pm .006$& $.840 \pm .031$    \\
ADMM (FDR-SVM)  &$\mathbf{.950} \pm .011$&$\mathbf{.967} \pm .014$&$\mathbf{.792} \pm .047$&$\mathbf{.798} \pm .017$&$.911 \pm .021$&$\mathbf{.938} \pm .005$& $\mathbf{.848} \pm .027$ \\
ADMM-SC (FDR-SVM)  &$\mathbf{.950} \pm .011$&$.966 \pm .014$&$.765 \pm .048$&$.797 \pm .019$&$.902 \pm .026$&$\mathbf{.938} \pm .005$& $.846 \pm .027$\\
\bottomrule
\end{tabular}
\end{table*}

\textbf{Results.} Table \ref{tab:real_world} presents the performance achieved by each model. Our proposed models consistently outperform the federated benchmark models on most datasets, often by a substantial margin. This underscores the value of DR in modeling uncertainty, and the benefits of using algorithms specifically designed for the FDR-SVM problem. We note that one or more of our FDR-SVM algorithms attains the highest F-1 score for all datasets. Additionally, the ADMM algorithm generally outperforms SM algorithm on most datasets, except for Parkinson's, which suggests that ADMM often converges in practice, in many settings, even if theoretical convergence is not guaranteed. 

We also observe that in some cases, ADMM-SC performs much worse than ADMM (e.g., on BCW and UKM) but can also closely match its performance (e.g., on Banknote, MM, and Rice). This suggests that pursuing guaranteed theoretical convergence comes at the cost of stronger regularization, and thus, potentially weaker performance. One notable observation is that the ADMM or SM algorithms can sometimes outperform the centralized model. This suggests that our proposed MoWB ambiguity set can \textbf{outperform} the classical Wasserstein ball in modeling uncertainty in some settings as hypothesized in Remark \ref{remark:amb_set}. Finally, we note that \texttt{FedAvg} and \texttt{FedProx} failed to consistently converge for the Rice dataset despite extensive hyperparameter tuning and a diminishing learning rate. This suggests a lack of stability potentially due to the non-smoothness of the hinge loss, which further highlights the benefits of our algorithms. We highlight through a one-sided Wilcoxon singed-rank test in Appendix \ref{app:stat_sig} that performance improvements offered by our model are statistically significant.

\subsection{Industrial Data Experiment}\label{sec:sens}
We utilize industrial data from degrading pumps to examine the performance of our models. We explore 5 settings: i) nominal: training data is distributed evenly across clients and classes, ii) client imbalance: training data distribution across clients is $[70\%, 15\%, 10\%, 5\%]$, iii) class imbalance: training data distribution across classes is $[90\%, 10\%]$, iv) client+class imbalance: a combination of the previous two settings, and v) noisy labels: $15\%$ of the training labels are flipped. This experiment contains two distinct components: 1) a sensitivity analysis, and 2) a benchmarking study. Performance is evaluated in terms of mean correct classification rate (mCCR) and F-1 score in the sensitivity analysis and benchmarking study, respectively.

\textbf{Dataset.} We utilize industrial data generated via a physics-driven pump model \citep{Mathworks}. The data contains healthy and leak fault classes. Client heterogeneity is simulated by generating different fault severities per client. We use $G=4$, $P=14$, $N=400$, and $N_{Test}=1000$ test samples. The test set contains 500 healthy samples, and 125 samples from each of the 4 fault severities. 

\subsubsection{Sensitivity Analysis}
\textbf{Baseline.} We compare our models to the the central DR-SVM benchmark by \cite{2019regularization}.

\textbf{Hyperparameters.} In this part of the experiment, we plot each algorithm's performance as a function of its hyperparameters. We examine global and local hyperparameters, and vary each of them separately. The global ones are initial step-size $\gamma$ for the SM algorithm, scale parameter $\rho$ for the ADMM algorithms, and total number of rounds $T$ for all algorithms. The local hyperparameters are the label flipping cost $\kappa_g$, and the local Wasserstein ball radius factor $\beta_g$, where the radius is $\varepsilon_g = \frac{1}{\beta_g}{N_g}$. This is used as a simplifying heuristic to relate the radius to the number of training samples. We also vary the central's $\varepsilon$ and $\kappa$, showing only the best performance as a benchmark line on the plots.

\begin{figure*}[ht]
    \centering
    \includegraphics[width=1\textwidth]{Figures/sensitivity_global.pdf}
    \caption{mCCR vs. the Global Hyperparameters, Comparing Our Proposed Methods to the Best-Performing Central Model.}
    \label{fig:global_imb}
\end{figure*}

\begin{figure*}[ht]
    \centering
    \includegraphics[width=1\textwidth]{Figures/sensitivity_local.pdf}
    \caption{mCCR vs. the Local Hyperparameters, Comparing Our Proposed Methods to the Best-Performing Central Model.}
    \label{fig:local_imb}
\end{figure*}

\textbf{Results.} The \textit{global hyperparameters} effects are shown in Figure \ref{fig:global_imb}. The SM often obtains a higher peak performance in most settings than ADMM, however, it can require more communication rounds to do so. This is highlighted in the `class' imbalance and `client + class' imbalance settings. The SM algorithm is also relatively stable to the choice of $\gamma$, and maintains peak performance across a wide range of values. However, the ADMM algorithm is sensitive to $\rho$, with performance rapidly decreasing as $\rho$ increases. This suggests that ADMM may require more involved global hyperparameter tuning in practice, but can achieve its peak performance in fewer communication rounds. As hypothesized in Remark \ref{rem:reg}, we observe that ADMM largely outperforms ADMM-SC
due to the additional strongly convex regularization terms. Finally, we also observe that SM and ADMM slightly outperform the best-performing central model in the noisy labels case. This can likely be attributed to our novel ambiguity set's improved uncertainty modeling capability.

The \textit{local hyperparameter} effects are shown in Figure \ref{fig:local_imb}. Generally, model performance improves as the radius of the local Wasserstein balls decreases (by increasing $\beta_g$). This suggests that performance degrades with larger local Wasserstein balls due to over-conservatism. However, in noisy labels settings, performance of the SM model deteriorates as the local radius decreases. This suggests the need for larger local ambiguity sets to adequately capture label uncertainty. We also observe that the SM model is highly sensitive to the local radius and $\kappa_g$ in noisy label settings, whereas the ADMM achieves its best performance across a broader range of hyperparameter values. This suggests the need for local hyperparameter fine tuning if SM is used in an application with highly uncertain labels. Moreover, it can be seen that in all other settings, ADMM's performance tends to improve as $\kappa_g$ increases, which is to be expected, since lower $\kappa_g$ implies greater anticipation of label uncertainty, and thus over-conservatism. Similar to our observation in the global hyperparameter experiments, we again observe the suboptimality of the ADMM-SC, which underscores the sacrifice in model performance that is associated with enforcing guaranteed convergence.

\subsubsection{Benchmarking}
\textbf{Baselines.} We utilize the same benchmark models utilized in the UCI data experiment.

\textbf{Hyperparameters.} We tune the same global and local hyperparamaters discussed in the sensitivity analysis of the industrial data experiment. However, we use 5-fold cross-validation for hyperparameter tuning, and we tune both the global and local hyperparameters simultaneously.

\textbf{Results.} Table \ref{tab:industrial} shows the results of this study, which are averaged over 10 repetitions. As in the UCI data experiment, we observe that one of our methods obtains the best performance out of all federated approaches for all the settings tested. This underscores the practical impact of our proposed model and its solution algorithms in federated classification problems. Unlike the UCI data experiment, however, we observe that the SM algorithm is the peak performer in most settings in this experiment. This suggests that algorithm choice should be influenced by the dataset under study among other factors. Finally, we observe that for this dataset ADMM-SC is largely outperformed by ADMM. This again provides an example where opting for theoretically guaranteed convergence may come at a sacrifice in model accuracy due to redundant regularization.

\begin{table*}[ht]
\centering
\caption{F-1 Score Attained by Classification Models on Industrial Dataset in 5 Settings.}
\label{tab:industrial}
\smaller
\begin{tabular}{lccccc}
\toprule
Model    & \multicolumn{1}{c}{Nominal} & \multicolumn{1}{c}{Client Imbalance} & \multicolumn{1}{c}{Class Imbalance} & \multicolumn{1}{c}{Client + Class Imbalance} & \multicolumn{1}{c}{Noisy Labels} \\ \midrule
Central (DR-SVM)  &$.939\pm .004$&$\mathbf{.930} \pm .012$&$\mathbf{.903} \pm .012$& $\mathbf{.901} \pm .017$&$\mathbf{.908} \pm .011$ \\ \midrule
FedSGD ($\ell_2$-SVM)  &$.886 \pm .008$&$.887 \pm .007$&$.675 \pm .014$&$.685 \pm .030$&$.861 \pm .009$\\
FedAvg ($\ell_2$-SVM) &$.923 \pm .006$&$.919 \pm .010$&$.866 \pm .018$&$.845 \pm .059$&$.894 \pm .017$ \\
FedProx ($\ell_2$-SVM) &$.926 \pm .010$&$.919 \pm .011$&$.862 \pm .019$&$.842 \pm .058$&$.894 \pm .019$ \\ 
FedDRO (KL) &$.913 \pm .007$& $.914 \pm .010$ & $.858 \pm .014$ &$.835 \pm .052$&$.883 \pm .012$ \\
\midrule
SM (FDR-SVM)  &$\mathbf{.942} \pm .006$&$\mathbf{.930} \pm .010$&$\mathbf{.883} \pm .022$&$\mathbf{.879} \pm .035$&$.894 \pm .015$    \\
ADMM (FDR-SVM)  &$.918 \pm .010$&$.910 \pm .028$&$.868 \pm .018$&$.855 \pm .025$&$\mathbf{.903} \pm .011$ \\
ADMM-SC (FDR-SVM)  &$.817 \pm .009$&$.819 \pm .011$&$.638 \pm .020$&$.627 \pm .028$&$.806 \pm .013$\\
\bottomrule
\end{tabular}
\end{table*}