\section{Further Experimental Details and Supplementary Results}\label{app:exp}
In this section we provide all the details of all the experiments presented in this paper, as well as the results of a \textbf{scalability experiment}. Please note that the all the code and instructions associated with all the experiments are available at \href{https://github.com/mibrahim41/FDR-SVM}{this link}.
\subsection{Software and Hardware Details}
All the experiments presented in this work were executed on Intel Xeon Gold 6226 CPUs @ 2.7 GHz (using 4 cores) with 120 Gb of DDR4-2993 MHz DRAM. Table \ref{tab:software} provides more detail on all the software used in the paper.

\begin{table}[ht]
\centering
\caption{Details on All the Software Used in the Numerical Experiments.}
\label{tab:software}
\begin{tabular}{lll}
\toprule
Software           & Version & License             \\
\midrule
Gurobi       & 10.0.1  & Academic license    \\
MATLAB       & R2021B  & Academic license    \\
Python     & 3.10.9  & Open source license \\
Scikit-Learn & 1.2.1   & Open source license \\
Numpy      & 1.23.5  & Open source license \\
Scipy       & 1.10.0  & Open source license \\
UCIMLRepo     & 0.0.3   & Open source license\\
\bottomrule
\end{tabular}
\end{table}

\subsection{Datasets Utilized}
\subsubsection{UCI Data Experiment}
We provide details on the datasets used in the experiment described in Section \ref{sec:real}. Note that Parkinson's exhibited very high levels of class imbalance ($75\%$ from one class and $25\%$ from the other), which suggests that the SM algorithm is more successful with data that exhibits such levels of imbalance. Moreover, note that the "Very Low" and "Low" classes in the UKM dataset were combined into one class, whereas "Middle" and "High" were combined into another.

\begin{table}[ht]
\centering
\caption{Details on Datasets Utilized for UCI Experiments.}
\label{tab:data_real}
\begin{tabular}{lll}
\toprule
Dataset                              & Abbreviation & License   \\
\midrule
Banknote Authentication \citep{banknote}             & Banknote     & CC BY 4.0 \\
Breast Cancer Wisconsin (Diagnostic) \citep{bcw}& BCW          & CC BY 4.0 \\
Connectionist Bench (Sonar) \citep{cb}         & CB           & CC BY 4.0 \\
Mammographic Mass    \citep{mm}                & MM           & CC BY 4.0 \\
Parkinson's     \citep{parkinsons}                     & Parkinson's  & CC BY 4.0 \\
Rice (Cammeo and Osmancik)   \citep{rice}        & Rice         & CC BY 4.0 \\
User Knowledge Modeling   \citep{ukm}           & UKM          & CC BY 4.0\\
\bottomrule
\end{tabular}
\end{table}

\subsubsection{Industrial Data Experiment}
The data used in the experiment described in Section \ref{sec:sens} is a simulation dataset that uses a physics-driven Simulink model to simulate the healthy and faulty operation of a reciprocating pump \citep{Mathworks}. The generated simulation data belongs to two classes: healthy pump and leak fault. We focus on the binary classification problem since binary classification models can directly extend to multiclass problems via a one-vs-all framework as mentioned previously. Therefore, performance in the binary setting is indicative of that in the multiclass setting. However, data is generated to simulate different severities of the leak fault, where each client has a different severity to simulate data heterogeneity across clients. Note that leak fault severity is controlled via a \texttt{leak\_area\_set\_factor} variable in the MATLAB script. The four values used in our experiments are $[1e-3,4e-3,7e-3,1e-2]$. Features extracted from the generated time series data (such as kurtosis and skewness) are used for classification. 

\subsection{Hyperparameter Details}
In all of our implementations of the SM algorithm we utilize a step-size that diminishes according to $\gamma(t) = \frac{\gamma}{t}$, where we treat $\gamma$ as a model hyperparameter. This step-size obeys the conditions required for algorithm convergence stated in Theorem \ref{thm:sm_conv}. Next, we provide details on the hyperparameter values used in the UCI Data Experiment and the Industrial Data Experiment in Sections \ref{sec:real} and \ref{sec:sens}, respectively.

\textbf{UCI Data Experiment.} For the centralized baseline, we tune $\varepsilon \in \{ 1 \times 10^b \}_{b=-5}^{-1}$ and $\kappa \in \{ 0.1,0.25,0.5,0.75,1\}$. For the federated baselines we use diminishing step-size of $\gamma(t) = \frac{\gamma(0)}{t}$, where $\gamma(0)$ is treated as a tuning hyperparameter and takes values $\gamma(t) \in \{1e-3,1e-2,1e-1,1e0\}$, and a local regularization penalty of $\frac{1}{10N_g}$ at each client. For \texttt{FedAvg} and \texttt{FedProx}, we utilize a local batch size of $20\%$ of the available training data, and $E=5$ local SGD epochs where appropriate. We also use a $\mu = 1$ for \texttt{FedProx}. For our proposed methods we fix $\kappa_g = 1$ and $\varepsilon_g = \frac{1}{10N_g}$, and we tune $\rho \in \{1e-3,1e-2,1e-1,1e0 \}$ and $\gamma \in \{1e0,1e1,1e2,1e3 \}$. Finally, for all federated methods (including baselines and ours) we use $G=4$ with equal client weights. Finally, for ADMM, ADMM-SC and federated baselines, we tune $T \in \{5,10,20,60,100,140,180,220\}$, whereas for SM we tune $T \in \{100,140,180,220\}$. All tuning is done via 5-fold cross-validation.

\textbf{Industrial Data Experiment - Sensitivity Analysis.} In the global hyperparameters experiment we evaluate the performance of our proposed federated algorithms for $T \in \{5,10,20,60,100,140,180,220\}$ and $\rho,\gamma \in \{1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3 \}$. We fix $\varepsilon_g = \frac{1}{10N_g}$ and $\kappa_g = 0.5$ for each client $g$. While such values of $\varepsilon_g$ and $\kappa_g$ may not be optimal, we use them to demonstrate that our proposed model can perform well when compared to the central baseline.

In the local hyperparameters testing, We evaluate the performance of both the our federated algorithms for $\varepsilon_g = \frac{1}{\beta N_g}$ where $\beta \in \{ 0.1,1,10,100 \}$ and for $\kappa_g = \kappa \in \{ 0.1,0.25,0.5,0.75,1 \}$. We fix $T = 220$ and $\gamma = 1\times 10^2$ and $T = 100$ and $\rho = 1\times 10^{-3}$ for the SM and ADMM algorithms, respectively. In all settings we evaluate the performance of the baseline central model for $\varepsilon \in \{ 1 \times 10^b \}_{b=-5}^{-1}$ and $\kappa \in \{ 0.1,0.25,0.5,0.75,1\}$, and we only report the peak performance achieved. 

In all settings we utilize $\tau_g=18\rho$ for the ADMM-SC algorithm, which is the minimum value $\tau_g$ can take while maintaining guaranteed convergence as shown in Theorem \ref{thm:admm_conv_cri}. We do this as increasing $\tau$ increases the strength of the redundant regularization, thereby impacting the performance.

\textbf{Industrial Data Experiment - Benchmarking.} In this portion we utilize 5-fold cross-validation to tune the same hyperparameters discussed in the previous paragraph. Namely, we fix $T=220$, and we tune $\rho \in \{ 1e-3,1e-2,1e-1 \}$ or $\gamma \in \{ 1e1, 1e2, 1e3 \}$, $\kappa \in \{ 0.1, 0.5, 1 \}$, and $\beta \in \{ 10, 100\}$ for all our methods. Tuning is done via 5-fold cross-validation.

\textbf{Model Parameter Initialization.} In all of our experiments, we use initial model parameters $\boldsymbol{w}^{(0)} = \boldsymbol{0}$ (i.e., a vector of zeros), and initial scaled Lagrange multipliers $\boldsymbol{\mu}_g^{(0)} = \boldsymbol{1}$ (i.e., a vector of ones).

\subsection{UCI Data Experiment Statistical Significance} \label{app:stat_sig}
In order to evaluate the statistical significance of the results presented in Table \ref{tab:real_world}, we perform a one-sided Wilcoxon signed-rank test. The test compares the performance of the best performing version of our model to that attained by each of the benchmarks in a pairwise fashion. The null $H_0$ and alternative $H_1$ hypotheses of this test are defined next.
\begin{itemize}
    \item $H_0$: The distribution of the differences in performance between our model and each benchmark has median zero. That is, there is no systematic increase or decrease between the pairs.
    \item $H_1$:  The median of the differences is greater than $0$. That is, our approach is statistically better.
\end{itemize}

The results of this test are presented in Table \ref{tab:wilc}, utilizing a significance level of $\alpha = 0.05$. The table indicates whether the null hypothesis $H_0$ is rejected or not. We observe from the table that the performance improvement offered by our model algorithm is indeed statistically significant for most datasets and most benchmark models. This is because we "Reject" the null hypothesis $H_0$ in most settings. This underscores the practical impact and performance improvements offered by our proposed model.

\begin{table*}[ht]
\centering
\caption{Results of One-Sided Wilcoxon Signed-Rank Test Performed on Results of Benchmarking Experiments on 7 UCI Datasets.}
\label{tab:wilc}
\smaller
\begin{tabular}{lccccccc}
\toprule
Model    & \multicolumn{1}{c}{Banknote} & \multicolumn{1}{c}{BCW} & \multicolumn{1}{c}{CB} & \multicolumn{1}{c}{MM} & \multicolumn{1}{c}{Parkinson's} & \multicolumn{1}{c}{Rice} & \multicolumn{1}{c}{UKM} \\ \midrule
FedSGD ($\ell_2$-SVM)  &Fail to reject&Reject&Reject&Reject&Reject&Reject& Reject\\
FedAvg ($\ell_2$-SVM) &Fail to reject&Reject&Fail to reject&Reject&Reject&Fail to reject& Fail to reject \\
FedProx ($\ell_2$-SVM) &fail to reject&Reject&Fail to reject&Reject&Reject&Reject& Fail to reject \\ 
FedDRO (KL) &Reject& Reject & Reject &Reject&Reject& Reject&Reject \\
\bottomrule
\end{tabular}
\end{table*}