\section{Additional Background on Wasserstein DRO} \label{app:bg}
Distributionally robust optimization has been recently popularized as an intermediate approach between stochastic programming (SP) \citep{shapiro_stochprog} and robust optimization (RO) \citep{bental_ro}. Indeed, it can be viewed as a stochastic programming problem where the true distribution $\mathbb{P}$ governing the data is unknown. Alternatively, it can be seen as a robust optimization problem where worst-case perturbations of the data distribution are modeled rather than those of individual data points. This makes DRO attractive as it is a method of modeling the uncertainty without requiring knowledge of the true distribution $\mathbb{P}$ (like in SP) or potentially being overly conservative (like in RO) \citep{bertsimas2004}. DRO relies on defining an ambiguity set $\mathcal{A}$ of distributions, and subsequently minimizing the worst-case risk attained by any distribution $\mathbb{Q}$ within the ambiguity set $\mathcal{A}$. There have been various different methods of defining the ambiguity set in the literature. This includes moment-based methods \citep{delage2010}, which use certain moment properties to define the set, and distance-based methods \citep{bayraksan2015,kuhn2019wasserstein}, which define the set as a sphere centered at some reference distribution, and whose radius is in the sense of some distance measure. Commonly used measures include $\phi$-divergences (such as KL divergence) \citep{bayraksan2015} and the Wasserstein distance \citep{kuhn2019wasserstein}. Moreover, in most Machine Learning problems, the reference distribution is taken to be the empirical distribution $\widehat{\mathbb{P}}_N$ of the $N$ training data samples.

In our work, we focus on ambiguity sets defined via the type-1 Wasserstein distance. This is because Wasserstein DRO offers many desirable advantages over its counterparts, as demonstrated by \cite{kuhn2019wasserstein}. For example, the Wasserstein ambiguity set can contain both discrete and continuous distributions regardless of the structure of the empirical distribution, which cannot be achieved by the KL divergence ambiguity set. Moreover, one can derive out-of-sample performance guarantees using concentration inequalities when using a Wasserstein ambiguity set, which cannot be achieved in moment-based approaches. The type-1 Wasserstein $W_{d,1}$ distance \citep{kant1958} is commonly referred to as optimal transport metric or earth mover’s distance. This is because of its interpretation as the minimum cost of transforming a distribution $\mathbb{Q}$ to $\mathbb{Q}^{\prime}$. Therefore, it utilizes a transportation cost function $d(\boldsymbol{\xi},\boldsymbol{\xi}^{\prime})$ to define the transportation cost function per unit mass from point $\boldsymbol{\xi}$ to point $\boldsymbol{\xi}^{\prime}$. We can express the type-1 Wasserstein distance mathematically as follows.
\begin{equation*}
    W_{d,1}(\mathbb{Q},\mathbb{Q}') \coloneqq \inf_{\pi \in \Pi(\mathbb{Q},\mathbb{Q}')} \int_{\Xi \times \Xi} d \left( \boldsymbol{\xi}, \boldsymbol{\xi}^{\prime} \right ) \pi(\text{d} \boldsymbol{\xi}, \text{d} \boldsymbol{\xi}^{\prime}),
\end{equation*}
where $d(\boldsymbol{\xi},\boldsymbol{\xi}^{\prime})$ denotes the transportation cost function, and $\Pi(\mathbb{Q},\mathbb{Q}')$ is the set of all joint distributions of $\boldsymbol{\xi}$ and $\boldsymbol{\xi}^{\prime}$ with marginals $\mathbb{Q}$ and $\mathbb{Q}^{\prime}$, respectively. Note that the data in our classification problem is comprised of continuous features $\boldsymbol{x} \in \mathcal{X} \subseteq \mathbb{R}^P$ and categorical labels $y \in \{ -1,+1 \}$. Therefore, a commonly used transportation cost function for such setting is
\begin{equation*}
d(\boldsymbol{\xi},\boldsymbol{\xi}^{\prime}) \coloneqq ||\boldsymbol{x} - \boldsymbol{x}^{\prime}|| + \kappa \mathbbm{1}_{\{y \neq y^{\prime}\}},
\end{equation*}
where $||\cdot||$ is any norm on $\mathbb{R}^P$, and $\kappa$ is the label-flipping cost, treated as a user-defined hyperparameter. This cost function allows us to quantify differences in both the features and labels between samples.