\section{Introduction}\label{sec:intro}
Original equipment manufacturers (OEMs) of industrial equipment (used in manufacturing, energy, and healthcare) sell their products to multiple customers under lucrative service contracts that demand stringent reliability standards—an especially critical requirement for systems such as aircraft engines or turbines in power plants. In order to meet these standards, OEMs must provide accurate diagnostics of impending  faults, including their types and severities \citep{dutta2023,yang2025,lei2020}. Most customers are unwilling to share their operational data due to confidentiality and security concerns, particularly in industries critical to national security (e.g., nuclear power plants). Consequently, OEMs face the challenge of leveraging dispersed data sources to develop analytic models capable of improving fault detection and classification.

Distributed fault diagnosis \citep{du2024} via federated learning (FL) \citep{mcmahan2017} offers a promising solution to this problem. FL enables the training of a global model on data that remains at each client, thereby preserving privacy while allowing OEMs to draw on broader information sources for more robust fault diagnosis. Despite these advantages, industrial data can be extremely noisy—due to harsh operating conditions and sensor limitations—and often suffers from labeling errors caused by variations in operator expertise. As a result, industrial datasets tend to be among the \textit{most uncertain and poorly labeled}. 

Uncertainty in a binary classification problem can be modeled by considering the features $\boldsymbol{x} \in \mathcal{X} \subseteq \mathbb{R}^P$ and labels $y \in \{ -1, +1 \}$ as random variables governed by an underlying distribution $\mathbb{P}$ \citep{shafieezadehabadeh2015distributionally}. An ideal classifier with parameters $\boldsymbol{w} \in \mathbb{R}^P$ and a loss function $\ell(\boldsymbol{w};(\boldsymbol{x},y))$ minimizes the expected risk $\mathbb{E}^{\mathbb{P}}[\ell(\boldsymbol{w};(\boldsymbol{x},y))]$. Since $\mathbb{P}$ is typically unknown, it is common practice to rely on an empirical distribution $\widehat{\mathbb{P}}_N$ derived from $N$ IID samples and then minimize the empirical risk $\mathbb{E}^{\widehat{\mathbb{P}}_N}[\ell(\boldsymbol{w};(\boldsymbol{x},y))]$ . However, in cases where the training data is noisy or limited, the resulting model can be highly suboptimal, leading to poor out-of-sample performance \citep{kuhn2019wasserstein,shafieezadehabadeh2015distributionally}.

Distributionally robust optimization (DRO) \citep{scarf1957min,delage2010,bayraksan2015,shapiro2017,datadrivenDRO,2019regularization,kuhn2019wasserstein} addresses these challenges by specifying an ambiguity set $\mathcal{A}$ of plausible data distributions. The model is trained by minimizing the worst-case risk 
$\sup_{\mathbb{Q \in \mathcal{A}}}\mathbb{E}^{\mathbb{Q}}[\ell(\boldsymbol{w};(\boldsymbol{x},y))]$ attained by any distribution $\mathbb{Q} \in \mathcal{A}$. A particularly popular approach is to define $\mathcal{A}$ as a Wasserstein ball around $\widehat{\mathbb{P}}_N$ \citep{kuhn2019wasserstein}. Wasserstein-based DRO (WDRO) has attracted growing attention in machine learning \citep{2019regularization,nieter2023,rui2023}. However, most WDRO research remains limited to centralized settings, and extending it to federated environments introduces significant challenges and computational complexities \citep{cher2020}.

\textbf{Contributions.} We develop a \textit{federated distributionally robust support vector machine} (FDR-SVM) that can be trained to global optimality on data distributed across $G$ clients. Using DRO allows our model to be robust to uncertainties in both features and labels. More importantly, our model does not rely on restrictive assumptions, such as Lipschitz smoothness or strong convexity, which are often imposed by existing FL approaches. Although differential privacy is vital in FL, our work focuses on robustness to distributional uncertainties. To the best of our knowledge, this is the first effort utilizing WDRO to robustify a FL model under such general conditions. The main contributions of this paper are:
\begin{enumerate}
    \item We propose a \textbf{M}ixture \textbf{o}f \textbf{W}asserstein \textbf{B}alls (MoWB) ambiguity set that generalizes the Wasserstein ball to the distributed setting. This lays the foundation for robustifying a variety of FL models under the DRO paradigm. We then prove that the true data distribution belongs to MoWB with a certain confidence level under a mild assumption, and we use it to derive our separable FDR-SVM formulation.

    \item We propose a subgradient method-based (SM) algorithm for training our FDR-SVM, where we rigorously derive the subgradient of the \textit{infinite-dimensional} worst-case risk problem at each client assuming the compactness of the feature support set. We then prove the convergence of this algorithm to global optimality, and derive its worst-case time complexity.

    \item We also propose an alternating direction method of multipliers-based (ADMM) algorithm for training our FDR-SVM, where we derive a convex, tractable optimization problem and a closed-form for local and global model updates, respectively. We show that this algorithm is only guaranteed convergence under the addition of a strongly convex term to each client's objective. While this may affect final model performance, convergence is achieved in fewer rounds than SM and without the need for feature support assumptions.

    \item We evaluate our proposed methods on an industrial dataset and various popular UCI repository datasets, where we study their hyperparameter sensitivity and demonstrate that the FDR-SVM typically outperforms state-of-the-art (SOTA) baselines.
\end{enumerate}