\section{Background and Prior Work}\label{sec:rel_works}
\textbf{Distributionally Robust Optimization.} DRO has gained popularity recently due to its applications in various areas of optimization and ML \citep{kuhn2019wasserstein}. The general 1-WDRO problem is mathematically formulated as
\begin{equation} \label{eq:dro_problem}
    \inf_{\boldsymbol{w} \in \mathcal{W}} \sup_{\mathbb{Q} \in \mathcal{A}_{\varepsilon,1,d}(\Xi)} \mathbb{E}^{\mathbb{Q}}[\ell],
\end{equation}
where $\ell$ is the loss function parameterized by $\boldsymbol{w} \in \mathcal{W}$, and $\mathbb{Q}$ is any distribution within ambiguity set $\mathcal{A}_{\varepsilon,1,d}(\Xi)$, which is defined as
\begin{equation}\label{eq:trad_amb_set}
    \mathcal{A}_{\varepsilon,1,d}(\Xi) \coloneqq \left \{ \mathbb{Q} \in \mathcal{P} \left ( \Xi \right ) \colon W_{d,1} \left ( \mathbb{Q},\widehat{\mathbb{P}}_N\right ) \leq \varepsilon \right \},
\end{equation}
where $\mathcal{P}(\Xi)$ is the set of all distributions supported on $\Xi$, and $W_{d,1}(\cdot,\cdot)$ is the type-$1$ Wasserstein distance equipped with transportation cost function $d(\boldsymbol{\xi},\boldsymbol{\xi}^{\prime})$. It has been shown in various works \citep{datadrivenDRO,kuhn2019wasserstein,shapiro2017,rui2023} that the DRO problem \eqref{eq:dro_problem} admits tractable, convex reformulations in many cases of practical interest. Moreover, it was also demonstrated by \cite{kuhn2019wasserstein} that the Wasserstein ambiguity set enjoys various attractive properties, such as its ability to assign point mass anywhere in the support set, and its interpretation as a confidence interval for $\mathbb{P}$. We provide further background on Wasserstein DRO in Appendix \ref{app:bg}.

Many efforts have successfully utilized DRO to robustify various classifiers. For example, \cite{2019regularization} develop Wasserstein DR logistic regression (LR) and SVM. Further, \cite{selvi2022} extend the DR LR model to data with mixed features. \cite{FACCINI2022} utilize a moment-based ambiguity set to derive a DR version of the SVM. Finally, \cite{sagawa2020} utilize group DRO to mitigate the tendency of classification deep neural networks (DNNs) to learn spurious correlations, relying on manual training data grouping. This is advanced by \cite{wu2023}, who utilize a DNN to perform the data grouping. All these efforts implicitly assume the availability of the training data at a central location, making them difficult to extend to FL settings \citep{cher2020}.

\textbf{Federated Learning.} Since its introduction by \cite{konecny2016e,mcmahan2017}, FL has garnered much attention due to its practical utility. The \texttt{FedAvg} algorithm introduced by \cite{mcmahan2017} relies on local stochastic gradient descent (SGD) updates by clients, and subsequent aggregation and rebroadcasting of model by the server. The work also introduces \texttt{FedSGD}, where each client only performs one local update step. \texttt{FedProx} \citep{li2020} adds a proximal term to the objective function to mitigate client heterogeneity issues. \cite{wang2020b} develop \texttt{FedNova}, where client updates are normalized to address data heterogeneity without impacting convergence. Alternatively, \cite{karimireddy20a} propose \texttt{SCAFFOLD}, where client drift is addressed with the introduction of control variables. Personalized FL algorithms include \texttt{FedPer} \citep{arivazhagan2019} which introduces a local personalization layer at each client, \texttt{FedEM} \citep{marfoq2021} which models local data distributions as a mixture of unknown distributions, \texttt{FedPer++} \citep{xu2022} which utilizes regularization to prevent local overfitting, and \texttt{FedL2P} \citep{lee2023} which uses meta-learning to learn a personalization strategy for each client.

\textbf{Distributionally Robust Federated Learning.} Recently, many efforts have combined ideas from DRO and FL. For example, \cite{deng2020} develop a DR version of \texttt{FedAvg}, hedging against uncertainty in client weights.  \cite{wu2022} propose mixup techniques in the local training stages, addressing noisy and heterogeneous client data. Further, \cite{zecchin2023} develop an efficient algorithm for a DR \texttt{FedAvg} algorithm with no central server. Alternatively, \cite{Huang2021CompositionalFL} combine FL with stochastic compositional optimization (CO), transforming the DR \texttt{FedAvg} algorithm into a CO problem. A \texttt{FedDRO} algorithm is proposed by \cite{Khanduri2023} as an extension of \texttt{FedAvg} for CO problems. \cite{lau2022} construct a Wasserstein ambiguity set from distributed data using barycenters, which may not exist and can be difficult to compute in a distributed fashion if they do. Moreover, \cite{cher2020} and \cite{le2024} propose distributed WDRO formulations. However, the earlier relies on peer-to-peer communication, while the latter assumes the Lipschitz smoothness and strong convexity of the loss function.