\section{Experimental details on CIFAR10}
\label{sec:app_cifar10}


In this section, we give the experimental details on the CIFAR10-based experiments shown in Figures \ref{fig:teaserplot} and \ref{fig:K_plot}. Moreover, we also conduct similar experiments using different neural network architectures. First, we give the full experimental details and then provide the results of the experiments using the different architectures.

\paragraph{Subsampling CIFAR10}
In all our experiments we subsample CIFAR10 to simulate the low sample size regime. We ensure that for all subsampled versions the number of samples of each class are equal. Hence, if we subsample to $500$ training images, then each class has exactly $50$ images, which are drawn uniformly from the $5k$ training images of the respective class.

\paragraph{Mask perturbation on CIFAR10}
We consider square black-mask perturbations; the attacker can set in the image a patch of size $2 \times 2$ to zero. The attack is a simplification of the patch-attack as considered in \cite{Wu20}. We show an example of a black-mask attack on each of the classes in CIFAR10 in Figure \ref{fig:cifar10_masks}. Clearly, the mask reduces the information about the class in the image as it occludes part of the object in the image.

\begin{figure}[!ht]
\centering
  \includegraphics[width=0.8\linewidth]{plotsAistats/cifar10_black_mask_attack.png}
  \caption{We show an example of a mask perturbation for all $10$ classes of CIFAR10. Even though the attack occludes part of the images, a human can still easily classify all images correctly.}
\label{fig:cifar10_masks}
\end{figure}

During test time, we evaluate the attack exactly by means of a full grid search over all possible windows. Note that a full grid search requires $900$ forward passes to evaluate one image, which computationally too expensive during training time. Therefore, we use the same approximation as in \cite{Wu20} at training time. For each image in the training batch, we compute the gradient from the loss with respect to the input. Using that gradient, which is a tensor in $\mathbb{R}^{3 \times 32 \times 32}$, we compute the $l_1$-norm of each patch by a full grid search and save the upper left coordinates of the $K$ windows with largest $l_1$-norm. The intuition is that windows with high $l_1$-norm are more likely to change the prediction. Out of the $K$ identified candidate windows, we take the most loss worsening by means of a full list-search. 

\begin{wrapfigure}{r}{0.4\textwidth}
\includegraphics[width=0.99\linewidth]{plotsAistats/K_plot_cifar.png}
\caption{We plot the standard error, robust error and susceptibility for varying attack strengths $K$. We see that the larger $K$, the lower the susceptibility, but the higher the standard error.}
\label{fig:K_plot}
\end{wrapfigure}

\paragraph{Experimental training details}
For all our experiments on CIFAR10, we adjusted the code provided by \cite{Phan21}. As typically done for CIFAR10, we augment the data with random cropping and horizontal flipping. For the experiments with results depicted in Figures \ref{fig:teaserplot} and \ref{fig:K_plot}, we use a ResNet18 network and train for $100$ epochs. We tune the parameters learning rate and weight decay for low robust error. For standard standard training, we use a learning rate of $0.01$ with equal weight decay. For adversarial training, we use a learning rate of $0.015$ and a weight decay of $10^{-4}$. We run each experiment three times for every dataset with different initialization seeds, and plot the average and standard deviation over the runs. 

For the experiments in Figure \ref{fig:teaserplot} and \ref{fig:num_obs_CIFAR} we use an attack strength of $K = 4$. Recall that we perform a full grid search at test time and hence have a good approximation of the robust accuracy and susceptibility score. 

\paragraph{Increasing training attack strength} We investigate the influence of the attack strength $K$ on the robust error for adversarial training. We take $\eps_{\text{tr}} = 2$ and $n = 500$ and vary $K$. The results are depicted in Figure \ref{fig:K_plot}. We see that for increasing $K$, the susceptibility decreases, but the standard error increases more severely, resulting in an increasing robust error. 


\paragraph{Robust error decomposition}
In Figure \ref{fig:teaserplot}, we see that the robust error increases for adversarial training compared to standard training in the low sample size regime, but the opposite holds when enough samples are available. For completeness, we provide a full decomposition of the robust error in standard error and susceptibility for standard and adversarial training. We plot the decomposition in Figure \ref{fig:num_obs_CIFAR}.

\begin{figure*}[!b]
\centering
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cifar10_robust_numobs.png}
  \caption{Robust error}
  \label{fig:RA_CIFAR_10_n}
\end{subfigure}
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cifar10_standard_numobs.png}
  \caption{Standard error}
  \label{fig:SA_CIFAR_10_n}
\end{subfigure}
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cifar10_sus_numobs.png}
  \caption{Susceptibility}
  \label{fig:Robustness_n}
\end{subfigure}

\caption{We plot the standard error, robust error and susceptibility of the subsampled datasets of 
CIFAR10 after adversarial and standard training. For small sample size, adversarial 
training has higher robust error then standard training. We see that the increase in standard error in comparison to the drop in susceptibility of standard versus robust training, switches between the low and high sample size regimes.}
\label{fig:num_obs_CIFAR}
\end{figure*}

\paragraph{Multiple networks on CIFAR10}
We run adversarial training for multiple network architectures on subsampled CIFAR10 ($n=500$) with mask perturbations of size $2 \times 2$ and an attack strength of $K=4$.  We plot the results in Table \ref{CIFAR10_diffArchitectures}. For all the different architectures, we notice a similar increase in robust error when trained with adversarial training instead of standard training.

\begin{table}[!ht]
\centering
\caption{We subsample CIFAR10 to a dataset of sample size $500$ and perform both standard training (ST) and adversarial training (AT) using different networks. We evaluate the resulting susceptibility score and the robust and standard error. }
\begin{tabular}{ |p{2cm}||p{2cm}||p{1cm}||p{1cm}|p{2cm}|p{2cm}|p{2cm}|}
 \hline
 \multicolumn{7}{|c|}{Adversarial training on CIFAR10} \\
 \hline
Architecture & learning rate & weight decay & Train type & standard error & robust error & Susceptibility\\
 \hline
 ResNet34 &   $ 0.02$  & $0.025$ &   ST  & 44 & 64 & 50 \\
 ResNet34 &   $0.015$  & $10^{-4}$ &   AT & 52 & 66 & 40\\
 ResNet50 &  $0.015$  & $0.03$  &   ST &  45 & 62 & 47\\
 ResNet50 &  $0.015$  &  $10^{-4}$ &   AT &  53 & 68 & 45\\
VGG11bn &  $0.03$ & $0.01$ & ST & 40 & 55 & 43\\
VGG11bn &   $0.015$  & $10^{-4}$ & AT & 48 &63 & 34\\
VGG16bn &  $0.02$ & $0.01$ & ST & 41 & 60 & 48\\
VGG16bn &   $0.015$  & $10^{-4}$ & AT & 50 & 65  & 42\\
 \hline
\end{tabular}
\label{CIFAR10_diffArchitectures}
\end{table}



\section{Static hand gesture recognition}
\label{sec:handgestures}

The goal of static hand gesture or posture recognition is to recognize hand gestures such as a pointing index finger or the okay-sign based on static data such as images \cite{Oudah20, Yang13}. The current use of hand gesture recognition is primarily in the interaction between computers and humans \cite{Oudah20}. More specifically, typical practical applications can be found in the environment of games, assisted living, and virtual reality \cite{Mujahid21}. In the following, we conduct experiments on a hand gesture recognition dataset constructed by \cite{Mantecon19}, which consists of near-infrared stereo images obtained using the Leap Motion device. First, we crop or segment the images after which we use logistic regression for classification. We see that adversarial logistic regression deteriorates robust generalization with increasing $\eps_{\text{tr}}$.

\paragraph{Static hand-gesture dataset}
We use the dataset made available by \cite{Mantecon19}. This dataset consists of near-infrared stereo images taken with the Leap Motion device and provides detailed skeleton data. We base our analysis on the images only. The size of the images is $640 \times 240$ pixels. The dataset consists of $16$ classes of hand poses taken by $25$ different people. We note that the variety between the different people is relatively wide; there are men and women with different posture and hand sizes. However, the different samples taken by the same person are alike.

We consider binary classification between the index-pose and L-pose, and take as a training set $30$ images of the users $16$ to $25$. This results in a training dataset of $300$ samples. We show two examples of the training dataset in Figure \ref{fig:original_examples}, each corresponding to a different class. Observe that the near-infrared images darken the background and successfully highlight the hand-pose. As a test dataset, we take $10$ images of each of the two classes from the users $1$ to $10$ resulting in a test dataset of size $200$.

\begin{figure}
    \centering
    \begin{subfigure}{0.49\textwidth}
    \includegraphics[width=.80\linewidth]{plotsAistats/Lpose.png}
    \caption{L pose}
    \label{fig:L_pose_or_example}
    \end{subfigure}
    \begin{subfigure}{0.49\textwidth}
    \includegraphics[width=.80\linewidth]{plotsAistats/Indexpose.png}
    \caption{Index pose}
    \label{fig:index_pose_or_example}
    \end{subfigure}
    \caption{We plot two images, where both correspond to the two different classes. We recognize the "L"-sign in Figure \ref{fig:L_pose_or_example} and the index sign in Figure \ref{fig:index_pose_or_example}. Observe that the near-infrared images highlight the hand pose well and blends out much of the non-useful or noisy background. }
\label{fig:original_examples}
\end{figure}

\paragraph{Cropping the dataset}
To speed up training and ease the classification problem, we crop the images from a size of $640 \times 240$ to a size of $200 \times 200$. We crop the images using a basic image segmentation technique to stay as close as possible to real-world applications. The aim is to crop the images such that the hand gesture is centered within the cropped image.

For every user in the training set, we crop an image of the L-pose and the index pose by hand. We call these images the training masks $\{\text{masks}_i \}_{i=1}^{20}$. We note that the more a particular window of an image resembles a mask, the more likely that the window captures the hand gesture correctly. Moreover, the near-infrared images are such that the hands of a person are brighter than the surroundings of the person itself. Based on these two observations, we define the best segment or window, defined by the upper left coordinates $(i,j)$, for an image $x$ as the solution to the following optimization problem:

\begin{equation}
\label{preprocessing}
    \argmin_{i \in [440], \Hquad j \in [40]} \sum_{l=1}^{20}\|\text{masks}_l-x_{\{i:i+200,j:j+200\}}\|^2_2 - \frac{1}{2}\|x_{\{i+w,j+h\}}\|_1.
\end{equation}
Equation \ref{preprocessing} is solved using a full grid search. We use the result to crop both training and test images. Upon manual inspection of the cropped images, close to all images were perfectly cropped. We replace the handful poorly cropped training images with hand-cropped counterparts.

\begin{figure}[!ht]
\centering
\begin{subfigure}{0.31\textwidth}
    \centering
    \includegraphics[width=.80\linewidth]{plotsAistats/L_147.png}
    \caption{Cropped L pose}
    \label{fig:cropped_L}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
    \centering
    \includegraphics[width=.80\linewidth]{plotsAistats/index_28.png}
    \caption{Cropped index pose}
    \label{fig:cropped_index}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
    \centering
    \includegraphics[width=.80\linewidth]{plotsAistats/L_pose_with_mask.png}
    \caption{Black-mask perturbation}
    \label{fig:cropped_L_mask}
\end{subfigure}
    \caption{In Figure \ref{fig:cropped_L} and \ref{fig:cropped_index} we show an example of the images cropped using Equation \ref{preprocessing}. We see that the hands are centered and the images have a size of $200 \times 200$. In Figure \ref{fig:cropped_L_mask} we show an example of the square black-mask perturbation.}
    \label{fig:preprocessing}
\end{figure}

\paragraph{Square-mask perturbations}
 Since we use logistic regression, we perform a full grid search to find the best adversarial perturbation at training and test time. For completeness, the upper left coordinates of the optimal black-mask perturbation of size $\eps_{\text{tr}} \times \eps_{\text{tr}}$ can be found as the solution to
\begin{equation}
\label{square_perturbations_logistic_regression}
    \text{arg}\max_{i \in [200-\eps_{\text{tr}}], \Hquad j \in [200-\eps_{\text{tr}}]} \sum_{l,m \in [\eps_{\text{tr}}]}\theta_{[i:i+l,j:j+m]}.
\end{equation}
The algorithm is rather slow as we iterate over all possible windows. We show a black-mask perturbation on an $L$-pose image in Figure \ref{fig:cropped_L_mask}.

\paragraph{Results} We run adversarial logistic regression with square-mask perturbations on the cropped dataset and vary the adversarial training budget and plot the result in Figure \ref{fig:eps_mask}. We observe attack that adversarial logistic regression deteriorates robust generalization. 

Because we use adversarial logistic regression, we are able to visualize the classifier. Given the classifier induced by $\theta$, we can visualize how it classifies the images by plotting $\frac{\theta - \min_{i \in [d]}\theta_{[i]}}{\max_{i \in [d]}\theta_{[i]}} \in [0,1]^{d}$. Recall that the class-prediction of our predictor for a data point $(x,y)$ is given by $\text{sign}(\theta^{\top} x) \in \{\pm 1\}$. The lighter parts of the resulting image correspond to the class with label $1$ and the darker patches with the class corresponding to label $-1$.

\begin{wrapfigure}{r}{0.4\textwidth}
\includegraphics[width=0.99\linewidth]{plotsAistats/mask_plot_main.png}
\caption{We plot the standard error and robust error for varying adversarial training budget $\eps_{\text{tr}}$. We see that the larger $\eps_{\text{tr}}$ the higher the robust error.}
\label{fig:eps_mask}
\end{wrapfigure}

We plot the classifiers obtained by standard logistic regression and adversarial logistic regression with training adversarial budgets $\eps_{\text{tr}}$ of $10$ and $25$ in Figure \ref{fig:visulation_log}. The darker parts in the classifier correspond to patches that are typically bright for the $L$-pose. Complementary, the lighter patches in the classifier correspond to patches that are typically bright for the index pose. We see that in the case of adversarial logistic regression, the background noise is much higher than for standard logistic regression. In other words, adversarial logistic regression puts more weight on non-signal parts in the images to classify the training dataset and hence exhibits worse performance on the test dataset.
 
 \newpage
\begin{figure}[!ht]
\centering
\begin{subfigure}{0.31\textwidth}
    \centering
    \includegraphics[width=.80\linewidth]{plotsAistats/natural_log_regr_result.png}
    \caption{$\eps_{\text{tr}} = 0 $}
    \label{fig:log_natural}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
    \centering
    \includegraphics[width=.80\linewidth]{plotsAistats/perTrain10logisticReg.png}
    \caption{$\eps_{\text{tr}} = 10 $}
    \label{fig:log_e10}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
    \centering
    \includegraphics[width=.80\linewidth]{plotsAistats/perTrain25logisticReg.png}
    \caption{$\eps_{\text{tr}} = 25$}
    \label{fig:log_e25}
\end{subfigure}
    \caption{We visualize the logistic regression solutions. In Figure \ref{fig:log_natural} we plot the vector that induces the classifier obtained after standard training. In Figure \ref{fig:log_e10} and Figure \ref{fig:log_e25} we plot the vector obtained after training with square-mask perturbations of size $10$ and $25$, respectively. We note the non-signal enhanced background correlations at the parts highlighted with the red circles in the image projection of the adversarially trained classifiers. }
    \label{fig:visulation_log}
\end{figure}

\section{Adversarial training hurts robust generalization for nonlinear feature learning}
\label{sec:app_theorycs}
\fy{i've no idea what your statement was, it was a whole blurb was redundant definitions so this butchering is the fastest guess after scanning it}

In this section, we give a mathematical explanation for the effect of
adversarial training with directed attacks\xspace increasing the robust error
for nonlinear feature learning models. In particular, we construct a
dataset, the concentric spheres dataset, that has exactly one discriminative
feature: the norm of the datapoints. Figure \ref{fig:app_cs_repeat}
and Figure \ref{fig:cs_numsamp_rob} show that the behaviour of the
feature learning model on our synthetic setting matches the behaviour
we observe on the linear synthetic \emph{and} on the real-world
datasets: in the low sample size regime, adversarial training
increasingly hurts robust generalization with increasing perturbation
set size.

More concretely, we discuss a two-layer neural network and conclude the same intuitive explanation as in the linear example. First, we introduce the dataset and model. Then, we discuss some theoretical results. Lastly, we plot and discuss experiments.

\subsection{Problem Setting}
In this subsection, we first introduce the concentric spheres distribution, 2-layer quadratic neural networks and the directed attack\xspace that we consider. Then, we show that the optimal robust classifier is included in our function space.

\paragraph{Distribution for concentric spheres}
We study the concentric spheres distribution as also used in
\cite{gilmer18, kolter19}. In particular, for
$0<\radius_{-1}<\radius_{1}$, we draw $(x,y) \sim \mathbb{P}_{\text{CS}}$ as follows: we draw a binary label $y\in \{+1, -1\}$
equiprobably and a covariate vector $x \in \mathbb{R}^{d}$ that is,
conditional on the label, distributed uniformly on the sphere of
radius $R_{y}$.

\paragraph{Perturbation sets}
In this example, the radius of the input corresponds to the signal,
hence, for a training perturbation size $0< \eps_{\text{tr}} <
\frac{\radius_{1}-\radius_{-1}}{2}$ and a covariate $x$ we may define
a directed attack\xspace as an attack out of the perturbation set
\begin{equation}
  \label{eq:pertsetsphere}
    \mathcal{S}_C(x, \eps_{\text{tr}}) = \left\{\delta \in \mathbb{R}^{d} \mid \delta = \frac{x}{\norm{x}_2}\eta, \Hquad |\eta|<\eps_{\text{tr}}\right\}.
\end{equation}

\paragraph{Neural network classifier}
Similar to prior work on concentric spheres such as \cite{gilmer18},
we consider two-layer neural networks with quadratic activations as
our parameterized function class with
\begin{equation*}
    f_\theta(x) = \left(x^T W_1 \right)^2 W_2 + b,
\end{equation*}
where $\theta =(W_1 , W_2,b)$ and $W_1 \in \mathbb{R}^{d \times p}$, $ W_2 \in \mathbb{R}^p$, $b\in \mathbb{R}$. Every function induces a decision boundary defined by
\begin{equation}
  \label{eq:decisionboundary}
  db(f_\theta) = \{x \in \mathbb{R}^{d} \mid f_{\theta}(x) = 0\}.
\end{equation} 
\fy{defined by} We note the function space of all neural networks as
$\mathcal{F}_{\text{QNN}} = \{f_\theta(x): W_1 \in \mathbb{R}^{d \times p}, W_2 \in
\mathbb{R}^p\}$.



In particular, the function space includes a \fy{ perfectly robust classifier: this an expression that is not defined}.
\fy{more like a fact than lemma}
\begin{lemma}
  If $p>d$, the function space $\mathcal{F}_{\text{QNN}}$ contains a classifier that minimizes the robust error against perturbations~\eqref{eq:pertsetsphere} defined by the distribution $\mathbb{P}_{\text{CS}}$.
\end{lemma}

\fy{given $f_\theta$ what even is the $db(f_\theta)$? - i fixed it}
\begin{proof}
  Clearly, for any consistent $\eps_{\text{te}}$, one perfectly robust classifier is a classifier with decision boundary ($db\left(f_{\theta}\right)$) the sphere with radius $R_{opt} = \frac{\radius_{-1}+\radius_{1}}{2}$. For a visualization see Figure \ref{fig:teaser_concentric_spheres}. Hence, it suffices to show that $\mathcal{F}_{\text{QNN}}$ includes a function that induces a decision boundary
  that is the sphere with radius $R_{opt}$.
 

 

When $p>d$, choosing
\begin{equation*}
    W_1 = \begin{pmatrix}
I_d & 0
\end{pmatrix},
\end{equation*}
$W_2 = 1_{\{p\}}$ and $b = -\radius_{-1}^2-\frac{\radius_{1}^2-\radius_{-1}^2}{2}$induces the decision boundary of $db\left(f_{\theta}\right)$ that is equivalent to a sphere of radius $R_{opt}$.
Note that this is only one particular parameter constellation. In fact, there exist infinitely many $\theta$ that induce the same decision boundary.
\end{proof}




\subsection{Geometric characterization of the two layer quadratic neural network}


\fy{this is again super poor language}
A decision boundary that is ellipsoid uses primarily the signal (norm), else hyperboloid, using angular information (useless features).

In experiments, we show that adversarial training
learns networks with hyperboloids as decision boundary. In
contrast, standard training leads to an ellipsoid.

This explains why the ``phenomenon'' also appears for CS
observed in experiments.

In this section we describe how we can quantify and
plot the ``hyperboloidity'' in learned
classifiers with respect to $\eps_{\text{tr}}$  \fy{why not numsamp}

\fy{this is A HUGE SECTION for just one plot of explanation ... }


\paragraph{Decision boundary of a two layer quadratic network.} To ease the flow of the text, we introduce a lemma close to the computation made in \cite{gilmer18}, which brings the quadratic neural network to a classical known form, here. We provide the proof in Subsection\ref{subsec:proof_lemma}.
\begin{lemma}
\label{lem:quadratic_symm_matrix}
For any 2-layer quadratic neural network with $p>d$, there exists a real symmetric matrix $A \in \mathbb{R}^{d \times d}$ such that 
\begin{equation}
f_{\theta}(x) = x^{\top} A x + b,
\end{equation}
for any $x \in \mathbb{R}^{d}$.
\end{lemma}


Let $A, b$ be the characterization of a two-layer quadratic neural
network as per Lemma \ref{lem:quadratic_symm_matrix}. Then, recalling
the definition of a decision boundary ~\eqref{eq:decisionboundary}
induced by $f_\theta$, we can define $A_{db}= -\frac{A}{b}$ such that
\begin{equation}
\label{db_quadrnetwork}
db(f_\theta) = \{ x \in \mathbb{R}^{d} \mid x^{\top} A_{db} x = 1 \},
\end{equation}
where we note that $A_{db}$ is a real symmetric matrix.
\fy{another fact}
\begin{fact}
  Let $\lambda^{\theta}$ be the vector with as entries all eigenvalues of
  $A_{db}$ induced by $f_{\theta}$. If $\lambda^{\theta}_{i} > 0$ for all $i
  \in [d]$, then $db(f_{\theta})$ is an ellipsoid, otherwise,
  $db(f_{\theta})$ is an hyperboloid.
\end{fact}




\fy{this should go to experimental section, here its theory still} See Figure
\ref{fig:teaser_concentric_spheres} for a visualization of ellipsoids and hyperboloids \fy{this is actually already experimental}. 
\fy{future work would be to show that this is indeed what happens}








\paragraph{The dissimilarity score.}

\fy{please use a macro for this score ...}
Since we cannot visualize the dec. boundaries in high dimensions, we can characterize how close the decision boundary is to the truth by calculating the ...

\fy{what is this notation $1_{d}$ -> please use macro and fix}
  Observe that any robust optimal two-layer quadratic neural network has $\lambda^{\theta}_i = \lambda_{opt} = \frac{4}{(\radius_{-1}+\radius_{1})^2}$ for all $i \in [d]$. We define our dissimilarity score as follows
\begin{equation}
    \text{dissim}(f_{\theta}) := \frac{1}{d}\norm{\lambda^{\theta}-1_{\{d\}}\lambda_{opt}}_2.
\end{equation}
We note the following properties of our dissimilarity score:
\fy{what the hell is happening here, the $R$ don't have values here yet do they?}
\begin{enumerate}
    \item $\text{Dissim}(f_{\theta})=0$ if and only if $f_{\theta}$ is a perfect robust classifier.
    \item If $f_{\theta}$ achieves perfect standard accuracy, then  $\text{dissim}(f_{\theta})< \sqrt{\frac{1}{d}}(\lambda_{opt}-\frac{1}{\radius_{1}^2}) = 1.03 \cdot 10^{-3}$. 
    \item Given we classify a training dataset correctly, if $\text{dissim}(f_{\theta}) > \sqrt{\frac{1}{d}}(\lambda_{opt}-\frac{1}{\radius_{1}^2})$, then $db(f_{\theta})$ is necessarily a hyperboloid. Moreover, the larger $\text{dissim}(f_{\theta})$, the more skewed the hyperboloid.
    \item If $\text{Dissim}(f_{\theta})$ is large, we have either a stretched out ellipsoid or a sharp hyperboloid. See Figure \ref{fig:teaser_concentric_spheres} for a visualisation of a 2D cut of a hyperboloid and an ellipsoid.
\end{enumerate}
Intuitively, the larger the dissimilarity score, the worse the robust accuracy of the classifier, because the classifier uses more angular information to interpolate the training points.


\begin{figure}[!ht]
\centering
  \includegraphics[width=0.5\linewidth]{plotsAistats/CS_teaser.png}
  \caption{2D cut along the first two dimensions of the concentric spheres
    example for $d=500$ to visualize the decision boundaries obtained via adversarial (left) and standard training (right) of a two-layer network with quadratic activations on training points not shown. The learned robust classifier has an hyperbolic decision boundary and uses angle information for classification, whereas the standard classifiers perfectly separates the classes.
   
  }
\label{fig:teaser_concentric_spheres}
\end{figure}


\subsection{Experimental details on concentric spheres example}
\label{sec:app_expcs}
\fy{what did you do here before? how were these two sections? why figure 3b?}

In this section, we further study the concentric spheres example experimentally and give experimental details on Figure \ref{fig:eps_cs}. More precisely, we observe that attack-model overfitting on the concentric spheres dataset is possible for multiple adversarial test perturbation budgets $\eps_{\text{te}}$.

\paragraph{Experimental details to Figure   \ref{fig:eps_cs}}
We sample $5$ datasets of size $n =10^5$ samples with varying dimensions $\dim{} = 350, \Hquad 500$ and $750$ of the concentric spheres distribution with radii $\radius_{-1} = 1$ and $\radius_{1} = 1.3$. The results we plot in Figure \ref{fig:eps_cs} are the average robust accuracies over the $5$ datasets and the shaded areas the respective standard deviations.

For optimization, we use Tensorflow \cite{tensorflow2015-whitepaper} and its Adam optimizer with a learning rate of $0.001$ and a batch size of $10$. We train for $100$ epochs or until all training points are correctly classified with a two-layer squared neural network of width $p = 1000$. We implement adversarial training by solving the inner maximization using
\begin{equation}
\label{eq:csATmaximization}
    x' = x-\eps_{\text{tr}}\frac{x}{\norm{x}_2}\text{sign}\left(x^T \partial_x f_{\theta}(x)\right).
\end{equation}

\subsection{Experimental results and discussion}
In this subsection we show the experimental results and discuss their implications.
In all experiments, we adversarially train a two-layer neural network with quadratic
activations and width $1000$ on the concentric spheres dataset with
$\radius_{-1}=1$ and $\radius_{1}=11$. We minimize the cross-entropy loss until
convergence. More experimental details can be found in
Section~\ref{sec:app_expcs}.

\begin{figure}[!t]
\vskip 0.2in
\centering
\begin{subfigure}[b]{0.33\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cs_eps_standard_app.png}
  \caption{Standard accuracy}
  \label{fig:app_cs_dissim}
\end{subfigure}
\begin{subfigure}[b]{0.33\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cs_eps_main.png}
  \caption{Robust accuracy}
  \label{fig:app_cs_repeat}
\end{subfigure}
\begin{subfigure}[b]{0.33\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cs_dissim_app.png}
  \caption{Dissimilarity score}
  \label{fig:app_cs_dissim}
\end{subfigure}

\caption{Experiments with the 2-layer quadratic network on the concentric spheres dataset with different adversarial budgets (x-axis). We use an input dimension of $d = 500$. Note that the robust and standard accuracy monotonically decrease for increasing $\eps_{\text{tr}}$. Moreover, the dissimilarity score monotonically increases with increasing $\eps_{\text{tr}}$, meaning that the network converge to sharper hyperbolic solutions, and hence uses more angular information to classify the training points. See Subsection \ref{sec:app_expcs} for experimental details.}
\label{fig:app_eps_cs}
\end{figure}

\paragraph{AT hurts robust generalization}

\fy{show by Varying the perturbation set size $\eps_{\text{tr}}$}

To understand the effect of increasing adversarial training budget from standard training $(\eps_{\text{tr}} = 0)$ to training with a large adversarial budget, we perform several experiments. We fix the dimension to be $d=500$, choose varying dataset sizes $n = 50, 100$ and $200$ and log the standard accuracy, robust accuracy ($\eps_{\text{te}} = 3$) and dissimilarity score. We plot the results In Figure \ref{fig:app_eps_cs}. Observe that similar to the linear case and the real-world experiments the robust accuracy decreases with increasing adversarial training budget $\eps_{\text{tr}}$. Moreover, the aggravating trend is more severe when $\frac{d}{n}$ is large. 

\paragraph{Explanation: AT makes DB more hyperboloid}
To understand the change in decision boundaries, we look at the dissimilarity score. In Figure \ref{fig:app_cs_dissim}, we note that the dissimilarity score monotonically increases with increasing $\eps_{\text{tr}}$. In particular, we see that the dissimilarity score is strictly larger than $1.03 \cdot 10^{-3}$ for all $\frac{d}{n}$ when $\eps_{\text{tr}} > 2.5$. By property 3 of the dissimilarity score, listed in the subsection above, this means that for large $\eps_{\text{tr}}$, the decision boundary is an hyperboloid; the classifier uses angular information the interpolate the training dataset. Moreover, the larger the dissimilarity score the more sharp the hyperboloid. For visualization of a hyperbolic and ellipsoidal decision boundary, we refer to Figure \ref{fig:teaser_concentric_spheres}.

Lastly, we also investigate the robustness score with varying $\eps_{\text{tr}}$. In Figure \ref{fig:cs_trade_off}, we plot the decomposition after adversarial training with increasing $\eps_{\text{tr}}$ and $n = 50$. Similar to the linear example, plotted in Figure \ref{fig:app_tradeoff_logreg}, we recognize an U-shape. Together with the increasing dissimilarity score, we can make the following arguments. For large $\eps_{\text{tr}}$, when the dissimilarity score is also large, we converge to networks with sharp hyperboloid decision boundaries. These are highly robust but have low standard accuracy. In contrast, when using standard training ($\eps_{\text{tr}} = 0$) we converge to decision boundaries close to the optimal robust one, which has high standard accuracy and robustness. 

In summary, first steering away from the optimal decision boundary (increasing $\eps_{\text{tr}}$) decreases standard accuracy and robust accuracy. Thereafter, when the decision boundary is a hyperboloid, increasing $\eps_{\text{tr}}$ causes to further decrease standard accuracy, but increase the robustness. The increase in robustness is the result of converging to a sharper hyperboloid decision boundary, which uses less the norm of the samples to classify them.

\begin{figure}[!b]
\vskip 0.2in
\centering
\begin{subfigure}[b]{0.49\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cs_rob_acc_numsamp_half.png}
  \caption{Robust accuracy}
  \label{fig:numobs}
\end{subfigure}
\begin{subfigure}[b]{0.49\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/cs_trade_off_decomposition_app.png}
  \caption{Standard-robust accuracy trade-off}
  \label{fig:cs_trade_off}
\end{subfigure}

\caption{ We set $d =  500, \radius_{-1} = 1, \radius_{1} = 11$ and $\eps_{\text{te}} = 3$. (a) Adversarial training on the concentric spheres dataset with increasing sample size. We see that for low sample sizes, adversarial training hurts robust accuracy, but for high sample sizes, we recognize the known regime where it helps robust generalization. (b) Robust accuracy decomposition of adversarial training with increasing perturbation budget $\eps_{\text{tr}}$. For large $\eps_{\text{tr}}$, we note how the robust accuracy decreases, while the robustness increases. The decrease is hence a result of decreasing standard accuracy. See Subsection \ref{sec:app_expcs} for experimental details.}
\label{fig:numobs_trade_off}
\end{figure}

\subsection{Proof of Lemma \ref{lem:quadratic_symm_matrix}}
\label{subsec:proof_lemma}

Let us start by recalling the two-layer squared neural network. A two-layer neural network is a function $f:\mathbb{R}^d\xrightarrow{}\mathbb{R}$, of the form
\begin{equation*}
    f_{\theta}(x) = \left(x^T W_1\right)^2 W_2+b.
\end{equation*}
We rewrite this equation in a quadratic form:
\begin{equation*}
    \begin{split}
        f(x) &= \left(\sum_{i=1}^d x_i W_{1,i}\right)^2 W_2 + b\\
        &= \sum_{j=1}^p\left(\sum_{i=1}^d x_i W_{1,i}\right)^2_j W_{2,j} + b\\
        &= \sum_{i,j = 1}^d a_{i,j}x_i x_j + b\\
        &= x^T A x + b,
    \end{split}
\end{equation*}
where 
\begin{equation*}
    a_{i,j} = \begin{cases}
    \sum_{m=1}^p W_{1,i,m}^2 W_{2,m} & \text{if i = j},\\
    \sum_{m=1}^p W_{1,i,m} W_{1,j,m} W_{2,m} & \text{if i $\neq$ j.}
    \end{cases}
\end{equation*}
Hence, the decision boundary of $f$ is a quadratic equation and any two-layer quadratic neural network can be written in the form $f(x) = x^T A x + b$.


\subsection{Attack-model overfitting for multiple adversarial test budgets $\eps_{\text{te}}$}
The choice of $\eps_{\text{te}} = 0.075$ is reasonable but somewhat arbitrary. Hence, we also conduct the same experiment with experimental details as in Section \ref{app_csexpdetails_main} with a different adversarial test perturbation budget $\eps_{\text{te}}=0.1$ and include standard accuracy ($\eps_{\text{te}}=0$). We plot the results of the experiments in Figure \ref{fig:n_d_exp_robust}. Again, we observe attack-model overfitting. 

\section{Bounds on the susceptibility score}
\label{app:susc}
In Theorem \ref{thm:linlinf}, we give non-asymptotic bounds on the robust and standard error of a linear classifier trained with adversarial logistic regression. Moreover, we use the robust error decomposition in susceptibility and standard error to gain intuition about how adversarial training may hurt robust generalization. In this section, we complete the result of Theorem \ref{thm:linlinf} by also deriving non-asymptotic bounds on the susceptibility score of the max $\ell_2$-margin classifier.

Using the results in Appendix \ref{sec:app_theorylinear}, we can prove following Corollary \ref{cor:robustness}, which gives non asymptotic bounds on the susceptibility score.
\begin{corollary}
\label{cor:robustness}
  Assume $d-1>n$. For the $\eps_{\text{te}}$-susceptibility on test samples from $\mathbb{P}_{r}$ with $2 \eps_{\text{te}} < r$ and perturbation sets in Equation~\eqref{eq:linfmaxpert} and~\eqref{eq:l1maxpert} the following holds:

For $\eps_{\text{tr}} < \frac{r}{2} - \tilde{\gamma}_{\max}$, with probability at least $1-2\text{e}^{-\frac{\alpha^2 (d-1)}{2}}$ for any $0<\alpha<1$, over the draw of a dataset $D$ with $n$ samples from $\mathbb{P}_{r}$, the $\eps_{\text{te}}$-susceptibility is upper and lower bounded by
  \begin{equation}
  \begin{split}
       &\suscept{\thetahat{\epstrain}} \leq \Phi \left(\frac{(r-2 \eps_{\text{tr}}) (\eps_{\text{te}} - \frac{r}{2})}{2 \tilde{\gamma}_{\max} \sigma}\right) - \Phi \left( \frac{(r-2 \eps_{\text{tr}})( -\eps_{\text{te}} - \frac{r}{2})}{2 \tilde{\gamma}_{\min}\sigma} \right)\\ 
       &\suscept{\thetahat{\epstrain}} \geq   \Phi \left(\frac{(r-2 \eps_{\text{tr}}) (\eps_{\text{te}} - \frac{r}{2})}{2 \tilde{\gamma}_{\min}\sigma}\right) - \Phi \left( \frac{(r-2 \eps_{\text{tr}})( -\eps_{\text{te}} - \frac{r}{2})}{2 \tilde{\gamma}_{\max} \sigma} \right)
        \end{split}
  \end{equation}
\end{corollary}

We give the proof in Subsection \ref{sec:proof_robust_cor}. Observe that the bounds on the susceptibility score in Corollary \ref{cor:robustness} consist of two terms each, where the second term decreases with $\eps_{\text{tr}}$, but the first term increases. We recognise following two regimes: the max $\ell_2$-margin classifier is close to the ground truth $e_1$ or not. Clearly, the ground truth classifier has zero susceptibility and hence classifiers close to the ground truth also have low susceptibility. On the other hand, if the max $l_2$-margin classifier is not close to the ground truth, then putting less weight on the first coordinate increases invariance to the perturbations along the first direction. Recall that by Lemma \ref{lem:maxmargin}, increasing $\eps_{\text{tr}}$, decreases the weight on the first coordinate of the max $\ell_2$-margin classifier. Furthermore, in the low sample size regime, we are likely not close to the ground truth. Therefore, the regime where the susceptibility decreases with increasing $\eps_{\text{tr}}$ dominates in the low sample size regime.

To confirm the result of Corollary \ref{cor:robustness}, we plot the mean and standard deviation of the susceptibility score of $5$ independent experiments. The results are depicted in Figure \ref{fig:logreg_robust}. We see that for low standard error, when the classifier is reasonably close to the optimal classifier, the susceptibility increases slightly with increasing adversarial budget. However, increasing the adversarial training budget, $\eps_{\text{tr}}$, further, causes the susceptibility score to drop greatly. Hence, we can recognize both regimes and validate that, indeed, the second regime dominates in the low sample size setting.




 

 
 \begin{figure*}[!b]
  \centering
\begin{subfigure}[b]{0.4\textwidth}
 \centering
  \includegraphics[width=0.99\linewidth]{plotsAistats/app_susceptibilty.png}
  \caption{Susceptibility score decreases with $\eps_{\text{tr}}$}
  \label{fig:app_robustness}
\end{subfigure}
\begin{subfigure}[b]{0.4\textwidth}
 \centering
  \includegraphics[width=0.99\linewidth]{plotsAistats/logreg_trade_off_plot.png}
  \caption{Robust error decomposition}
  \label{fig:app_tradeoff_logreg}
\end{subfigure}
\caption{We set $r = 6$, $d = 1000$, $n = 50$ and $\eps_{\text{te}} = 2.5$. (a) We plot the average susceptibility score and the standard deviation over 5 independent experiments. Note how the bounds closely predict the susceptibility score. (b) For comparison, we also plot the robust error decomposition in susceptibility and standard error. Even though the susceptibility decreases, the robust error increases with increasing adversarial budget $\eps_{\text{tr}}$.}
  \vspace{-0.2in}
\label{fig:logreg_robust}
\end{figure*}

\subsection{Proof of Corollary \ref{cor:robustness}}
\label{sec:proof_robust_cor}
We proof the statement by bounding the robustness of a linear classifier. Recall that the robustness of a classifier is the probability that a classifier does not change its prediction under an adversarial attack. The susceptibility score is then given by 
\begin{equation}
\label{eq:rob_sus}
\suscept{\thetahat{\epstrain}} = 1 - \robness{\thetahat{\epstrain}}.
\end{equation}

The proof idea is as follows: since the perturbations are along the first basis direction, $e_1$, we compute the distance from the robust $l_2$-max margin $\thetahat{\epstrain}$ to a point $(X,Y) \sim \mathbb{P}$. Then, we note that the robustness of $\thetahat{\epstrain}$ is given by the probability that the distance along $e_1$, from $X$ to the decision plane induced by $\thetahat{\epstrain}$ is greater then $\eps_{\text{te}}$. Lastly, we use the non-asymptotic bounds of Lemma \ref{lem:boundsmaxmargin}.

Recall, by Lemma \ref{lem:maxmargin}, the max $l_2$-margin classifier is of the form of
\begin{equation}
\label{eq:robustmaxmarg}
\thetahat{\epstrain} = \frac{1}{\sqrt{(r-2 \eps_{\text{tr}})^2 + 4 \tilde{\gamma}^{2}}}\left[r-2\eps_{\text{tr}},  2 \tilde{\gamma} \tilde{\theta} \right].
\end{equation}
Let $(X, Y) \sim \mathbb{P}$. The distance along $e_1$ from $X$ to the decision plane induced by $\thetahat{\epstrain}$, $\decplanegen{\thetahat{\epstrain} }$, is given by
\begin{equation*}
d_{e_1}(X, \decplanegen{\thetahat{\epstrain}}) = \left| \indof{X}{1}+ \frac{1}{ \indof{\thetahat{\epstrain}}{0}} \sum_{i=2}^{ d }  \indof{\thetahat{\epstrain}}{i} \indof{X}{i} \right|. 
\end{equation*}
Substituting the expression of $\thetahat{\epstrain}$ in Equation \ref{eq:robustmaxmarg} yields
\begin{equation*}
d_{e_1}(X, \decplanegen{\thetahat{\epstrain}}) = \left| \indof{X}{1} + 2 \tilde{\gamma} \frac{1}{(r-\eps_{\text{tr}})} \sum_{i=2}^{d}  \indof{\tilde{\theta}}{i}   \indof{X}{i} \right|. 
\end{equation*}
Let $N$ be a standard normal distributed random variable. By definition $\| \tilde{\theta}\|_2^2 = 1$ and using that a sum of Gaussian random variables is again a Gaussian random variable, we can write 
\begin{equation*}
d_{e_1}(X,\decplanegen{\thetahat{\epstrain}}) = \left| \indof{X}{1} + 2 \tilde{\gamma} \frac{\sigma}{(r-\eps_{\text{tr}})} N \right|. 
\end{equation*}
The robustness of $\thetahat{\epstrain}$ is given by the probability that $d_{e_1}(X,\decplanegen{\thetahat{\epstrain}}) > \eps_{\text{te}}$. Hence, using that $X_1 = \pm \frac{r}{2}$ with probability $\frac{1}{2}$, we get
\begin{equation}
\label{eq:robustness_form}
\robness{\thetahat{\epstrain}} = P\left[ \frac{r}{2} + 2 \tilde{\gamma} \frac{\sigma}{(r-2\eps_{\text{tr}})}  N > \eps_{\text{te}} \right] + P \left[ \frac{r}{2} + 2 \tilde{\gamma} \frac{\sigma}{(r-\eps_{\text{tr}})}  N < -\eps_{\text{te}} \right].
\end{equation}
We can rewrite Equation \ref{eq:robustness_form} in the form
\begin{equation*}
\robness{\thetahat{\epstrain}}  = P \left[ N > \frac{(r-2\eps_{\text{tr}}) (\eps_{\text{te}} - \frac{r}{2})}{2 \tilde{\gamma}\sigma} \right] + P \left[  N <  \frac{(r-2\eps_{\text{tr}})( -\eps_{\text{te}} -\frac{ r}{2})}{2 \tilde{\gamma}\sigma} \right].
\end{equation*}
Recall, that $N$ is a standard normal distributed random variable and denote by $\Phi$ the cumulative standard normal density. By definition of the cumulative denisity function, we find that
\begin{equation*}
\robness{\thetahat{\epstrain}} = 1 - \Phi \left(\frac{(r-2\eps_{\text{tr}}) (\eps_{\text{te}} - \frac{r}{2})}{2 \tilde{\gamma}\sigma}\right) + \Phi \left( \frac{(r-2 \eps_{\text{tr}})( -\eps_{\text{te}} - \frac{r}{2})}{2 \tilde{\gamma}\sigma} \right).
\end{equation*}
Substituting the bounds on $\tilde{\gamma}$ of Lemma \ref{lem:boundsmaxmargin} gives us the non-asymptotic bounds on the robustness score and by Equation \ref{eq:rob_sus} also on the susceptibility score.  


\section{Experimental details on the linear model}

\label{sec:logregapp}
In this section, we provide detailed experimental details to Figures \ref{fig:main_theorem} and \ref{fig:lineartradeoff}.

We implement adversarial logistic regression using stochastic gradient descent with a learning rate of $0.01$. Note that logistic regression converges logarithmically to the robust max $l_2$-margin solution. As a consequence of the slow convergence, we train for up to $10^7$ epochs. Both during training and test time we solve $\max_{x_i' \in \pertset{x_i}{\eps_{\text{tr}}}} L(f_\theta(x_i') y_i)$ exactly. Hence, we exactly measure the robust error.
Unless specified otherwise, we set $\sigma= 1$,  $r = 12$ and $\eps_{\text{te}} = 4$. 

\paragraph{Experimental details on Figure \ref{fig:main_theorem}} (a) We draw $5$ datasets with $n= 50$ samples and input dimension $d=1000$ from the distribution $\mathbb{P}$. We then run adversarial logistic regression on all $5$ datasets with adversarial training budgets, $\eps_{\text{tr}} = 1$ to $5$. To compute the resulting robust error gap of all the obtained classifiers, we use a test set of size $10^{6}$. Lastly, we compute the lower bound given in part 2. of Theorem \ref{thm:linlinf}. (b) We draw $5$ datasets with different sizes $n$ between $50$ and $10^4$. We take an input dimension of $d = 10^4$ and plot the mean and standard deviation of the robust error after adversarial and standard logistic regression over the $5$ samples.(c) We again draw $5$ datasets for each $d/n$ constellation and compute the robust error gap for each dataset.

\paragraph{Experimental details on Figure \ref{fig:lineartradeoff}} For both (a) and (b) we set $d = 1000$, $\eps_{\text{te}} = 4$, and vary the adversarial training budget ($\eps_{\text{tr}}$) from $1$ to $5$. For every constellation of $n$ and $\eps_{\text{tr}}$, we draw $10$ datasets and show the average and standard deviation of the resulting robust errors. In (b), we set $n = 50$.









\section{Theoretical statements for the linear model}

\label{sec:app_theorylinear}
Before we present the proof of the theorem, we introduce two lemmas are of separate interest that are used throughout the proof of Theorem 1. Recall that the definition of the (standard normalized) maximum-$\ell_2$-margin solution (max-margin solution in short) of a dataset $D =\{(x_i, y_i)\}_{i=1}^n$ corresponds to
\begin{equation}
  \label{eq:stdmaxmargin}
  \thetahat{} := \argmax_{\|\theta\|_2\leq 1} \min_{i\in [n]} y_i \theta^\top x_i,
\end{equation}
by simply setting $\eps_{\text{tr}} = 0$ in Equation~\eqref{eq:maxmargin}. The $\ell_2$-margin of $\thetahat{}$ then reads $\min_{i\in[n]} y_i \thetahat{\top} x_i$. Furthermore for a dataset $D = \{(x_i, y_i)\}_{i=1}^n$ we refer to the induced dataset $\widetilde{D}$ as the dataset with covariate vectors stripped of the first element, i.e.
\begin{equation}
  \widetilde{D} = \{(\tilde{x}_i, y_i)\}_{i=1}^n :=  \{ ((x_i)_{[2:d]}, y_i) \}_{i=1}^n, 
\end{equation}
where $(x_i)_{[2:d]}$ refers to the last $d-1$ elements of the vector $x_i$. Furthermore, remember that for any vector $z$, $\indof{z}{j}$ refers to the $j$-th element of $z$ and $e_j$ denotes the $j$-th canonical basis vector.
Further, recall the distribution $\mathbb{P}_r$ as defined in
Section~\ref{logreg_linear_model}: the label $y \in \{+1, -1\}$ is
drawn with equal probability and the covariate vector is sampled as $x
= [y\frac{r}{2}, \tilde{x}]$ where $\tilde{x} \in \mathbb{R}^{d-1}$ is
a random vector drawn from a standard normal distribution,
i.e. $\tilde{x} \sim \mathcal{N}(0, \sigma^2 I_{d-1})$. We generally allow
$r$, used to sample the training data, to differ from $r_{\text{test}}$, which is
used during test time.

The following lemma derives a closed-form expression for the normalized max-margin solution for any dataset with fixed separation $r$ in the signal component, and that is linearly separable in the last $d-1$ coordinates with margin $\tilde{\gamma}$.

\begin{lemma}
\label{lem:maxmargin}
Let $D = \{(x_i,y_i)\}_{i=1}^{n}$ be a dataset that
consists of points $(x,y) \in \mathbb{R}^{d}\times\{\pm 1\}$ and
$\xind{1} = y\frac{r}{2}$, i.e. the covariates $x_i$ are
deterministic in their first coordinate given $y_i$ with
separation distance $r$. Furthermore, let the induced dataset
$\widetilde{D}$ also be linearly separable by the normalized
max-$\ell_2$-margin solution $\tilde{\theta}$ with an $\ell_2$-margin 
$\tilde{\gamma}$. Then, the normalized max-margin solution of the
original dataset $D$ is given by
\begin{equation}
\label{eq:lemmaxmargin}
\thetahat{} = \frac{1}{\sqrt{r^2 + 4 \tilde{\gamma}^{2}}}\left[r,  2 \tilde{\gamma} \tilde{\theta} \right].
\end{equation}
Further, the standard accuracy of $\thetahat{}$ for data drawn from $\mathbb{P}_{r_{\text{test}}}$ reads
\begin{equation}
  \label{eq:stdaccmaxmargin}
  \mathbb{P}_{r_{\text{test}}}(Y \thetahat{\top} X > 0) = \Phi\left(
  \frac{r \:r_{\text{test}} }{4\sigma\: \tilde{\gamma}} \right).
\end{equation}
\end{lemma}
The proof can be found in Section~\ref{sec:maxmarginproof}. The next lemma provides high probability upper and lower bounds
for the margin $\tilde{\gamma}$ of $\widetilde{D}$ when $\tilde{x}_i$ are drawn from the normal distribution.
\begin{lemma}
\label{lem:boundsmaxmargin}
Let $\widetilde{D}=\{(\Tilde{x}_i,y_i)\}_{i=1}^{n}$ be a random dataset where $y_i \in \{\pm 1\}$ are equally distributed and $\tilde{x}_i \sim \mathcal{N}(0,\sigma I_{d-1})$ for all $i$, and $\tilde{\gamma}$ is the maximum $\ell_2$ margin that can be written as
\begin{equation*}
 
  \tilde{\gamma}= \max_{\|\tilde{\theta}\|_2 \leq 1} \min_{i \in [n]} y_i \tilde{\theta}^{\top} \Tilde{x}_i .
\end{equation*}
Then, for any $t \geq 0$, with probability greater than $1-2e^{-\frac{t^2}{2}}$, we have $\tilde{\gamma}_{\min}(t) \leq \tilde{\gamma} \leq \tilde{\gamma}_{\max}(t)$ where
\begin{align*}
  \label{Crude_bounds_subsequent_maxmar}
  &\tilde{\gamma}_{\max}(t) = \sigma \left( \sqrt{\frac{d-1}{n}} + 1  + \frac{t}{\sqrt{n}}\right), \:\: \tilde{\gamma}_{\min}(t)= \sigma \left( \sqrt{\frac{d-1}{n}} -1 - \frac{t}{\sqrt{n}}\right).
\end{align*}  
\end{lemma}





\subsection{Proof of Theorem~\ref{thm:linlinf}}
\label{sec:thmproof}








Given a dataset $D = \{(x_i, y_i)\}$ drawn from $\mathbb{P}_r$, it is easy to see that the (normalized) $\eps_{\text{tr}}$-robust max-margin solution~\eqref{eq:maxmargin} of $D$ with respect to signal-attacking perturbations $\pertset{\eps_{\text{tr}}}{x_i}$ as defined in Equation~\eqref{eq:linfmaxpert}, can be written as
\begin{equation}
\begin{aligned}
  \label{eq:robmaxmargin}
  \thetahat{\eps_{\text{tr}}} &= \argmax_{\|\theta\|_2\leq 1}  \min_{i\in [n], x_i' \in \pertset{x_i}{\eps_{\text{tr}}}} y_i \theta^\top x'_i \\
  &= \argmax_{\|\theta\|_2\leq 1}\min_{i\in [n],|\beta|\leq \eps_{\text{tr}}}y_i \theta^\top (x_i + \beta e_1) \nonumber\\
  &= \argmax_{\|\theta\|_2\leq 1} \min_{i\in [n]} y_i \theta^\top (x_i - y_i \eps_{\text{tr}} \sign(\thetaind{1}) e_1). \nonumber
\end{aligned}
\end{equation}
Note that by definition, it is equivalent to the (standard normalized)
max-margin solution $\thetahat{}$ of the shifted dataset ${D_{\epstrain} =
  \{(x_i - y_i \eps_{\text{tr}} \sign(\thetaind{1}) e_1,
  y_i)\}_{i=1}^n}$. Since $D_{\epstrain}$ satisfies the assumptions of
Lemma~\ref{lem:maxmargin}, it then follows directly that the
normalized $\eps_{\text{tr}}$-robust max-margin solution reads
\begin{equation}
  \label{eq:appmaxmargin}
  \thetahat{\eps_{\text{tr}}} = \frac{1}{\sqrt{(r -2\eps_{\text{tr}})^2 + 4 \tilde{\gamma}^{2}}}\left[r-2\eps_{\text{tr}},  2 \tilde{\gamma} \tilde{\theta} \right],
\end{equation}
by replacing $r$ by $r - 2\eps_{\text{tr}}$ in
Equation~\eqref{eq:lemmaxmargin}. Similar to above, $\tilde{\theta} \in
R^{d-1}$ is the (standard normalized) max-margin solution of
$\{(\tilde{x}_i, y_i)\}_{i=1}^n$ and $\tilde{\gamma}$ the corresponding
margin.

\paragraph{Proof of 1.}
We can now compute the $\eps_{\text{te}}$-robust accuracy of the
$\eps_{\text{tr}}$-robust max-margin estimator $\thetahat{\eps_{\text{tr}}}$ for a
given dataset $D$ as a function of $\tilde{\gamma}$. Note that in
the expression of $\thetahat{\eps_{\text{tr}}}$, all values are fixed for a
fixed dataset, while $0\leq \eps_{\text{tr}}\leq r-2\tilde{\gamma}_{\max}$ can be chosen.
First note that for a test distribution $\mathbb{P}_r$, the
$\eps_{\text{te}}$-robust accuracy, defined as one minus the robust error (Equation~\eqref{eq:roberr}), for a classifier
associated with a vector $\theta$, can be written as
\begin{align}
  \label{eq:robacc_closed}
  \robacc{\theta} &= \mathbb{E}_{X,Y\sim \mathbb{P}_r} \left[\Indi{\min_{x'
        \in \pertset{X}{\eps_{\text{te}}}} Y \theta^\top x'>0}\right] \\
  &=   \mathbb{E}_{X,Y\sim \mathbb{P}_{r}} \left[ \Indi{ Y \theta^\top X -
      \eps_{\text{te}} \thetaind{1} >0}\right] = \mathbb{E}_{X,Y\sim \mathbb{P}_{r}}
  \left[\Indi{ Y \theta^\top (X - Y\eps_{\text{te}} \sign(\thetaind{1}) e_1) >0}\right]
  \nonumber
\end{align}
Now, recall that
by Equation~\eqref{eq:appmaxmargin} and the assumption in the
theorem, we have $r-2\eps_{\text{tr}}>0$, so that $\sign(\thetahat{\eps_{\text{tr}}})=1$.
Further, using the definition of the $\pertset{\eps_{\text{tr}}}{x}$ in
Equation~\eqref{eq:linfmaxpert} and by definition of the
distribution $\mathbb{P}_r$, we have $\indof{X}{1} = Y
\frac{r}{2}$.
Plugging into Equation~\eqref{eq:robacc_closed} then yields
\begin{align*}
  \robacc{\thetahat{\eps_{\text{tr}}}}&= \mathbb{E}_{X,Y\sim \mathbb{P}_{r}} \left[\Indi{ Y \thetahat{\eps_{\text{tr}} \top} (X - Y\eps_{\text{te}}  e_1) >0}\right] \\
  &=   \mathbb{E}_{X,Y\sim \mathbb{P}_{r}}\left[\Indi{ Y \thetahat{\eps_{\text{tr}} \top} (X_{-1} + Y\left(\frac{r}{2} - \eps_{\text{te}}\right)  e_1) >0}\right] \\
  &= \mathbb{P}_{r- 2 \eps_{\text{te}}} (Y\thetahat{\eps_{\text{tr}} \top} X >0 )
\end{align*}
where $X_{-1}$ is a shorthand for the random vector $X_{-1} = (0;
  \indof{X}{2}, \dots, \indof{X}{d})$.  The assumptions in
Lemma~\ref{lem:maxmargin} ($D_{\epstrain}$ is linearly separable) are
satisfied whenever the $n<d-1$ samples are distinct, i.e. with
probability one. Hence applying Lemma~\ref{lem:maxmargin} with
$r_{\text{test}} = r - 2\eps_{\text{te}}$ and $r = r -
2\eps_{\text{tr}}$ yields
\begin{equation}
  \label{eq:arsenal}
  \robacc{\thetahat{\eps_{\text{tr}}}} =
  \Phi\left(\frac{r(r-2\eps_{\text{te}})}{4\sigma \tilde{\gamma}}
  - \eps_{\text{tr}} \frac{r-2\eps_{\text{te}}}{2\sigma \tilde{\gamma}}\right).
\end{equation}
Theorem statement a) then follows by noting that
$\Phi$ is a monotically decreasing function in $\eps_{\text{tr}}$.
The expression for the robust error then follows by noting that $1-\Phi(-z) = \Phi(z)$ for any $z \in \mathbb{R}$
and defining
\begin{equation}
  \label{eq:varphidef}
  \tilde{\varphi} = \frac{\sigma \tilde{\gamma}}{r/2 - \eps_{\text{te}}}.
\end{equation}


\paragraph{Proof of 2.}
First define $\varphi_{\text{min}}, \varphi_{\text{max}}$ using $\tilde{\gamma}_{\min}, \tilde{\gamma}_{\max}$ as in Equation~\eqref{eq:varphidef}. Then we have by Equation~\eqref{eq:arsenal}
\begin{align*}
  \roberr{\thetahat{\eps_{\text{tr}}}} - \roberr{\thetahat{0}} &= \robacc{\thetahat{0}} - \robacc{\thetahat{\eps_{\text{tr}}}}\\
  &=   \Phi\left(\frac{r/2}{\tilde{\varphi}}\right) - \Phi\left(\frac{r/2 - \eps_{\text{tr}}}{\tilde{\varphi}}\right)\\
  &= \int_{r/2-\eps_{\text{tr}}}^{r/2} \frac{1}{\sqrt{2\pi}\tilde{\varphi}} \text{e}^{- \frac{x^2 }{\tilde{\varphi}^2}} d x
\end{align*}


By plugging in $t = \sqrt{\frac{2 \log 2/\delta}{n}}$ in
Lemma~\ref{lem:boundsmaxmargin}, we obtain that with probability at
least $1-\delta$ we have
\begin{equation*}
   \tilde{\gamma}_{\min} := \sigma 
                \left[\sqrt{\frac{d-1}{n}} - \left(1+\sqrt{\frac{2 \log (2/\delta)}{n}}\right)\right] \leq \tilde{\gamma} \leq \sigma 
                \left[\sqrt{\frac{d-1}{n}} + \left(1+\sqrt{\frac{2 \log (2/\delta)}{n}}\right)\right] =: \tilde{\gamma}_{\max}
\end{equation*}
and equivalently $\varphi_{\text{min}} \leq \tilde{\varphi} \leq \varphi_{\text{max}}$.

Now note the general fact that for all
$\tilde{\varphi} \leq \sqrt{2} x$ the density function  
$f(\tilde{\varphi}; x) = \frac{1}{\sqrt{2\pi}\tilde{\varphi}} \text{e}^{- \frac{x^2 }{\tilde{\varphi}^2}} $
is monotonically increasing in $\tilde{\varphi}$.

By assumption of the theorem, $\tilde{\varphi} \leq \sqrt{2} (r/2-\eps_{\text{tr}})(r/2-\eps_{\text{te}})$ so that $f(\tilde{\varphi}; x) \geq f(\varphi_{\text{min}};x)$ for all $x\in [r/2-\eps_{\text{tr}},r/2]$ and therefore
\begin{equation*}
   \int_{r/2-\eps_{\text{tr}}}^{r/2} \frac{1}{\sqrt{2\pi}\tilde{\varphi}} \text{e}^{- \frac{x^2 }{\tilde{\varphi}^2}} d x \geq  \int_{r/2-\eps_{\text{tr}}}^{r/2} \frac{1}{\sqrt{2\pi}\varphi_{\text{min}}} \text{e}^{- \frac{x^2 }{\tilde{\varphi}^2}} d x = \Phi\left(\frac{r/2}{\varphi_{\text{min}}}\right) - \Phi\left(\frac{r/2-\eps_{\text{tr}}}{\varphi_{\text{min}}}\right).
\end{equation*}
and the statement is proved.





\subsection{Proof of Corollary~\ref{cor:l1extension}}
We now show that Theorem~\ref{thm:linlinf} also holds for
$\ell_1$-ball perturbations with at most radius $\epsilon$.  Following
similar steps as in Equation~\eqref{eq:appmaxmargin}, the
$\eps_{\text{tr}}$-robust max-margin solution for $\ell_1$-perturbations can
be written as
\begin{equation}
  \label{eq:maxmarginl1}
  \thetahat{\eps_{\text{tr}}} := \argmax_{\|\theta\|_2 \leq 1}\min_{i\in [n]}  y_i \theta^\top (x_i  - y_i  \eps_{\text{tr}} \sign(\indof{\theta}{j^\star(\theta)}) e_{j^\star(\theta)})
\end{equation}
where $j^\star(\theta) := \argmax_j |\theta_j|$ is the index of the maximum absolute value of $\theta$.
We now prove by contradiction that the robust max-margin solution for
this perturbation set~\eqref{eq:l1maxpert} is equivalent to the solution~\eqref{eq:appmaxmargin} for the perturbation set~\eqref{eq:linfmaxpert}.
We start by assuming that $\thetahat{\epstrain}$ does not solve
Equation~\eqref{eq:appmaxmargin}, which is equivalent to assuming $1\not \in
j^\star(\thetahat{\epstrain})$ by definition. We now show how this assumption leads
to a contradiction.

Define the shorthand $\maxind := j^\star(\thetahat{\epstrain}) -1$. Since
$\thetahat{\epstrain}$ is the solution of~\eqref{eq:maxmarginl1}, by definition, we
have that $\thetahat{\epstrain}$ is also the max-margin solution of the shifted
dataset $D_{\epstrain} :=(x_i - y_i \eps_{\text{tr}} \sign(\thetaind{\maxind+1})
e_{\maxind+1}, y_i)$.  Further, note that by the assumption that $1
\not \in j^\star(\thetahat{\epstrain})$, this dataset $D_{\epstrain}$ consists of input
vectors $x_i = (y_i \frac{r}{2}, \tilde{x}_i - y_i \eps_{\text{tr}}
\sign(\thetaind{\maxind+1}) e_{\maxind+1} )$.  Hence via
Lemma~\ref{lem:maxmargin}, $\thetahat{\epstrain}$ can be written as
\begin{equation}
  \label{eq:sml}
       \thetahat{\epstrain} = \frac{1}{\sqrt{r^2 - 4 (\tilde{\gamma}^{\eps_{\text{tr}}})^2}} [r, 2 \tilde{\gamma}^{\eps_{\text{tr}}} \tilde{\theta}^{\eps_{\text{tr}}}],
\end{equation}
where $\tilde{\theta}^{\eps_{\text{tr}}}$ is the normalized max-margin solution
of  $\widetilde{D} := (\tilde{x}_i
- y_i \eps_{\text{tr}} \sign(\indof{\tilde{\theta}}{\maxind}) e_{\maxind},
y_i)$.

We now characterize $\tilde{\theta}^{\eps_{\text{tr}}}$. Note that by
assumption, $\maxind = j^\star(\tilde{\theta}^{\eps_{\text{tr}}}) = \argmax_j
|\indof{\tilde{\theta}^{\eps_{\text{tr}}}}{j}|$. Hence, the normalized max-margin
solution $\tilde{\theta}^{\eps_{\text{tr}}}$ is the solution of
\begin{equation}
  \label{eq:maxmarginsmall}
  \tilde{\theta}^{\eps_{\text{tr}}} := \argmax_{\|\tilde{\theta}\|_2 \leq 1}
  \min_{i\in [n]} y_i \tilde{\theta}^\top \tilde{x}_i - \eps_{\text{tr}}
  |\indof{\tilde{\theta}}{\maxind}| 
\end{equation}
Observe that the minimum margin of this estimator
$\tilde{\gamma}^{\eps_{\text{tr}}}=\min_{i\in [n]} y_i
(\tilde{\theta}^{\eps_{\text{tr}}})^\top \tilde{x}_i - \eps_{\text{tr}}
|\indof{\tilde{\theta}^{\eps_{\text{tr}}}}{\maxind}|$ decreases with
$\eps_{\text{tr}}$ as the problem becomes harder $\tilde{\gamma}^{\eps_{\text{tr}}}
\leq \tilde{\gamma}$, where the latter is equivalent to the margin of
$\tilde{\theta}^{\eps_{\text{tr}}}$ for $\eps_{\text{tr}} = 0$.  Since $r >
2\tilde{\gamma}_{\max}$ by assumption in the Theorem, by Lemma~\ref{lem:boundsmaxmargin}
 with probability at least $1-2\text{e}^{-\frac{\alpha^2 (d-1)}{n}}$, we then have that $r> 2\tilde{\gamma} \geq 2\tilde{\gamma}^{\eps_{\text{tr}}}$. Given
the closed form of $\thetahat{\epstrain}$ in Equation~\eqref{eq:sml}, it
directly follows that $\indof{\thetahat{\epstrain}}{1} = r >
2\tilde{\gamma}^{\eps_{\text{tr}}} \|\tilde{\theta}^{\eps_{\text{tr}}}\|_2 =
\|\indof{\thetahat{\epstrain}}{2:d}\|_2$ and hence $1\in j^\star(\thetahat{\epstrain})$. This
contradicts the original assumption $1\not \in j^\star(\thetahat{\epstrain})$ and
hence we established that $\thetahat{\eps_{\text{tr}}}$ for the
$\ell_1$-perturbation set~\eqref{eq:l1maxpert} has the same closed
form~\eqref{eq:robmaxmargin} as for the perturbation
set~\eqref{eq:linfmaxpert}.

The final statement is proved by using the analogous steps as in
the proof of 1. and 2. to obtain the closed form of the robust accuracy of
$\thetahat{\eps_{\text{tr}}}$.

\subsection{Proof of Lemma~\ref{lem:maxmargin}}
\label{sec:maxmarginproof}


We start by proving that $\thetahat{}$ is of the form
\begin{equation}
\label{Eq:max_margin_param_form_total_D}
\thetahat{} = \left[a_1, a_2 \tilde{\theta} \right],
\end{equation}
for $a_1, a_2 > 0$. Denote by $\decplanegen{\theta}$ the plane through the origin with normal $\theta$. We define $d\left((x,y), \decplanegen{\theta} \right)$ as the signed euclidean distance from the point $(x,y) \in D \sim \mathbb{P}_{r}$ to the plane $\decplanegen{\theta}$. The signed euclidean distance is the defined as the euclidean distance from x to the plane if the point $(x,y)$ is correctly predicted by $\theta$, and the negative euclidean distance from $x$ to the plane otherwise. We rewrite the definition of the max $l_2$-margin classifier. It is the classifier induced by the  normalized vector $\thetahat{}$, such that 
\begin{equation*}
\max_{\theta \in \mathbb{R}^{d}} \min_{(x,y) \in D}d\left( \left(x,y\right),\decplanegen{\theta}\right)  = \min_{(x,y) \in D} d\left( \left(x,y \right),\decplanegen{\thetahat }\right).
\end{equation*}
We use that $D$ is deterministic in its first coordinate and get
\begin{equation*}
\begin{split}
	\max_{\theta}\min_{(x,y) \in D}d\left(\left(x,y\right), \decplanegen{\theta} \right) &= \max_{\theta}\min_{(x,y) \in D} y (\thetaind{1} \xind{1} + \tilde{\theta}^{\top} \tilde{x})\\
	&= \max_{\theta}  \theta_1  \frac{r}{2} + \min_{(x,y) \in D}  y \tilde{\theta}^{\top} \Tilde{x}.
	\end{split}
\end{equation*}
Because $r >0$, the maximum over all $\theta$ has $\thetahatind{}{1} \geq 0$. Take any $a > 0$ such that $\|\tilde{\theta}\|_2 = a$.  By definition the max $l_2$-margin classifier, $\tilde{\theta}$, maximizes $\min_{(x,y) \in D} d\left(\left(x,y\right), \decplanegen{\theta} \right)$. Therefore, $\thetahat{}$ is of the form of Equation \eqref{Eq:max_margin_param_form_total_D}. 

Note that all classifiers induced by vectors of the form of Equation \eqref{Eq:max_margin_param_form_total_D} classify $D$ correctly.  Next, we aim to find expressions for $a_1$ and $a_2$ such that Equation \eqref{Eq:max_margin_param_form_total_D} is the normalized max $l_2$-margin classifier. The distance from any $x \in D$ to $\decplanegen{\thetahat{}}$ is
\begin{equation*}
d\left(x,\decplanegen{\thetahat{}} \right) = \left| a_1 \xind{1}  + a_2 \tilde{\theta}^{\top} \tilde{x} \right|.
\end{equation*}
Using that $\xind{1} = y \frac{r}{2}$ and that the second term equals $a_2 d\left(x, \decplanegen{\tilde{\theta}}\right)$, we get
\begin{equation}
\label{eq:distance_to_opt_intermidate}
d\left(x,  \decplanegen{\thetahat{}}\right) =  \left| a_1 \frac{r}{2}  + a_2 d\left(x, \decplanegen{\tilde{\theta}}\right) \right| = a_1 \frac{r}{2}  + \sqrt{1-a_1^2} d\left(x, \decplanegen{\tilde{\theta}}\right).
\end{equation}
Let $(\tilde{x},y) \in \widetilde{D}$ be the point closest in Euclidean
distance to $\tilde{\theta}$. This point is also the closest point in
Euclidean distance to $\decplanegen{\thetahat{}}$, because by Equation
\eqref{eq:distance_to_opt_intermidate} $d\left(x,
\decplanegen{\thetahat{}}\right)$ is strictly decreasing for
decreasing $d\left(x, \decplanegen{\tilde{\theta}}\right)$. We maximize
the minimum margin $d\left(x, \decplanegen{\thetahat{}} \right)$ with
respect to $a_1$. Define the vectors $a = \left[a_1,
  a_2\right]$ and $v = \left[\frac{r}{2}, d\left(x,
  \decplanegen{\tilde{\theta}}\right)\right]$. We find using the dual
norm that
\begin{equation*}
a = \frac{v}{\|v\|_2}.
\end{equation*}
Plugging the expression of $a$ into Equation
\eqref{Eq:max_margin_param_form_total_D} yields that $\thetahat{}$ is
given by
\begin{equation*}
	\thetahat{} = \frac{1}{\sqrt{r^2 + 4 \tilde{\gamma}^2}}\left[r,  2 \tilde{\gamma}\tilde{\theta} \right].
\end{equation*}

For the second part of the lemma we first decompose
\begin{equation}
  \label{eq:jacob}
\mathbb{P}_{r_{\text{test}}} (Y\thetahat{\top} X >0 ) = \frac{1}{2}\mathbb{P}_{r_{\text{test}}} \left[ \thetahat{\top} X >0 \mid Y=1 \right]  +\frac{1}{2}\mathbb{P}_{r_{\text{test}}} \left[\thetahat{\top} X <0 \mid Y=-1\right]\nonumber
\end{equation}
We can further write 
\begin{align}
  \label{eq:cumul1}
\mathbb{P}_{r_{\text{test}}} \left[\thetahat{\top} X > 0 \mid
  Y = 1\right] &=\mathbb{P}_{r_{\text{test}}} \left[\sum_{i=2}^{d}\indof{\thetahat{}}{i} \indof{X}{i} > -
  \indof{\thetahat{}}{1} \: \indof{X}{1} \mid Y=1\right]\\
&= \mathbb{P}_{r_{\text{test}}} \left[2 \tilde{\gamma} \sum_{i=1}^{d-1}\indof{\tilde{\theta}}{i} \indof{X}{i} > -
  r \: \frac{r_{\text{test}}}{2} \mid Y=1\right]\nonumber\\
&= 1-\Phi\left(-\frac{r\: r_{\text{test}}}{4\sigma \tilde{\gamma}} \right) =
\Phi\left(\frac{r \: r_{\text{test}}}{4\sigma \tilde{\gamma}} \right) \nonumber
\end{align}
where $\Phi$ is the cumulative distribution function. The second equality
follows by multiplying by the normalization constant on both sides and the
third equality is due to the fact that $\sum_{i=1}^{d-1}\indof{\tilde{\theta}}{i} \indof{X}{i}$ is
a zero-mean Gaussian with variance $\sigma^2\|\tilde{\theta}\|^2_2 = \sigma^2$ since $\tilde{\theta}$ is normalized.
Correspondingly we can write
\begin{align}
  \label{eq:cumul2}
\mathbb{P}_{r_{\text{test}}} \left[\thetahat{\top} X < 0 \mid
  Y = -1\right] &=\mathbb{P}_{r_{\text{test}}} \left[2\tilde{\gamma}
  \sum_{i=1}^{d-1}\indof{\tilde{\theta}}{i} \indof{X}{i} < -
  r \left(- \frac{r_{\text{test}}}{2}\right) \mid Y=-1\right] = \Phi\left(\frac{r \:r_{\text{test}}}{4\sigma \tilde{\gamma}}\right) 
\end{align}
so that we can
combine~\eqref{eq:jacob} and~\eqref{eq:cumul1} and \eqref{eq:cumul2} to obtain
$\mathbb{P}_{r_{\text{test}}} (Y\thetahat{\top} X >0 ) = \Phi \left(\frac{r \:r_{\text{test}}}{4\sigma \tilde{\gamma}}\right)$. This concludes the proof of the lemma.









\subsection{Proof of Lemma \ref{lem:boundsmaxmargin}}
\label{sec:boundsmaxmargin}

The proof plan is as follows. We start from the definition of the max
$\ell_2$-margin of a dataset. Then, we rewrite the
max $\ell_2$-margin as an expression that includes a random matrix with independent
standard normal entries. This allows us to prove the upper and lower bounds for the
max-$\ell_2$-margin in Sections~\ref{sec:gammaupperbound} and ~\ref{sec:gammalowerbound}
respectively, using non-asymptotic estimates on the singular values of
Gaussian random matrices.

Given the dataset $\widetilde{D} =  \{(\tilde{x}_i, y_i)\}_{i=1}^{n}$, we define the random matrix
\begin{equation}
\label{eq:randmatrixsamples}
X = \begin{pmatrix}
\tilde{x}_1^{\top}\\
\tilde{x}_2^{\top}\\
...\\
\tilde{x}_{n}^{\top}
\end{pmatrix}.
\end{equation}
where $\tilde{x}_i \sim \mathcal{N}(0,\sigma I_{d-1})$. 
Let $\mathcal{V}$ be the class of all perfect predictors of $\widetilde{D}$. For a matrix $A$ and vector $b$ we also denote by $|Ab|$ the vector whose entries correspond to the absolute values of the entries of $Ab$. 
Then, by definition
\begin{equation}
\label{maxmargindefgammaproof}
\tilde{\gamma} = \max_{v \in \mathcal{V}, \|v\|_2=1} \min_{j \in [n]} \indof{|X v|}{j} = \max_{v \in \mathcal{V}, \|v\|_2=1} \min_{j \in [n]} \sigma \indof{|Q v|}{j},
\end{equation}
where $Q = \frac{1}{\sigma} X$ is the scaled data matrix.

In the sequel we will use the operator norm of a matrix $A \in \mathbb{R}^{n \times d-1}$.
\begin{equation*}
\| A\|_2 = \sup_{v \in \mathbb{R}^{d-1} \mid \|v\|_2=1} \|A v \|_2
\end{equation*}
and denote the maximum singular value of a matrix $A$ as $s_{\text{max}} (A)$ and the minimum singular value as $s_{\text{min}}(A)$.

\subsubsection{Upper bound}
\label{sec:gammaupperbound}


Given the maximality of the
operator norm and since the minimum entry of the vector
$|Q v|$ must be smaller than $\frac{\|Q\|_2}{\sqrt{n}}$, we can upper bound
$\tilde{\gamma}$ by
\begin{equation*}
\tilde{\gamma}  \leq \sigma  \frac{1}{\sqrt{n}} \|Q{}\|_2.
\end{equation*}
Taking the expectation on both sides with respect to the draw of
$\widetilde{D}$ and noting $\|Q\|_2 \leq
s_{\text{max}}\left(Q\right)$,
it follows from
Corollary 5.35  of \cite{vershynin12}
that for all $t\geq 0$:
\begin{equation*}
\mathbb{P} \left[\sqrt{d-1}+\sqrt{n}+t \geq s_{\text{max}}\left(Q\right) \right] \geq 1-2e^{-\frac{t^2}{2}}.
\end{equation*}
Therefore, with a probability greater than $1-2e^{-\frac{t^2}{2}}$,
\begin{equation*}
\tilde{\gamma} \leq  \sigma \left(1+ \frac{t+\sqrt{d-1}}{\sqrt{n}}\right).
\end{equation*}

\subsubsection{Lower bound}
\label{sec:gammalowerbound}
By the definition in Equation \eqref{maxmargindefgammaproof}, if we
find a vector $v \in \mathcal{V}$ with $\|v\|_2=1$ such that for an
$a>0$, it holds that $\Hquad \min_{j \in n} \sigma
\indof{|X v|}{j} > a$, then $\tilde{\gamma} > a$.


Recall the definition of the max-$\ell_2$-margin
as in Equation \ref{eq:randmatrixsamples}.
As $n < d-1$, the random matrix $Q$ is a wide
matrix, i.e. there are more columns than rows and therefore the
minimal singular value is $0$.
Furthermore, $Q$ has rank $n$ almost surely and hence 
for all $c >0$, there exists a $v \in \mathbb{R}^{d-1}$ such that
\begin{equation}
\label{eq:existencerhs}
 \sigma Q v= 1_{\{ n\}}c> 0,
\end{equation}
where $ 1_{\{ n \}}$ denotes the all ones vector of dimension $n$. The smallest non-zero singular value of $Q$, $s_{\text{min, nonzero}}(Q)$, equals the smallest non-zero singular value of its transpose $Q^{\top}$. Therefore, there also exists a $v \in \mathcal{V}$ with $\|v\|_2=1$ such that
\begin{equation}
\label{minimum_step_gamma}
\tilde{\gamma} \geq  \min_{j \in [n]} \sigma \indof{|Q v|}{j} \geq \sigma s_{\text{min,nonzeros}}\left(Q^{\top}\right)\frac{1}{\sqrt{n}},
\end{equation}
where we used the fact that any vector $v$ in the span of non-zero eigenvectors satisfies $\|Q  v \|_2 \geq s_{\text{min, nonzeros}}(Q)$ and the existence of a solution $v$ for any right-hand side as in Equation \ref{eq:existencerhs}.
Taking the expectation on both sides,
Corollary 5.35 of \cite{vershynin12} yields that with a probability greater than $1-2e^{-\frac{t^2}{2}}, t\geq 0$ we have
\begin{equation}
\tilde{\gamma} \geq \sigma\left( \frac{\sqrt{d-1}-t}{\sqrt{n}}-1\right).
\end{equation}





\section{Experimental details on the Waterbirds dataset}
\label{sec:waterbirds}
In this section, we discuss the experimental details and construction of the Waterbirds dataset in more detail. We also provide ablation studies of attack parameters such as the size of the motion blur kernel, plots of the robust error decomposition with increasing $n$, and some experiments using early stopping.

\paragraph{The waterbirds dataset}

To build the Waterbirds dataset, we use the CUB-200 dataset \cite{Welinder10}, which contains images and labels of $200$ bird species, and $4$ background classes (forest, jungle/bamboo, water ocean, water lake natural) of the Places dataset \cite{zhou17}.The aim is to recognize whether or not the bird, in a given image, is a waterbird (e.g. an albatros) or a landbird (e.g. a woodpecker). To create the dataset, we randomly sample equally many water- as landbirds from the CUB-200 dataset. Thereafter, we sample for each bird image a random background image. Then, we use the segmentation provided in the CUB-200 dataset to segment the birds from their original images and paste them onto the randomly sampled backgrounds. The resulting images have a size of $256 \times 256$. Moreover, we also resize the segmentations such that we have the correct segmentation profiles of the birds in the new dataset as well. For the concrete implementation, we use the code provided by \cite{Sagawa20}.

\paragraph{Experimetal training details}
Following the example of \cite{Sagawa20}, we use a ResNet50 pretrained on the ImageNet dataset for all experiments, a weight-decay of $10^{-4}$, and train for $300$ epochs using the Adam optimizer. Extensive fine-tuning of the learning rate resulted in an optimal learning rate of $0.006$ for all experiments in the low sample size regime. Adversarial training is implemented as suggested in \cite{madry18}: at each iteration we find the worst case perturbation with an exact or approximate method. In all our experiments, the resulting classifier interpolates the training set. We plot the mean over all runs and the standard deviation of the mean. 

\paragraph{Specifics to the motion blur attack}
Fast moving objects or animals are hard to photograph due to motion blur. Hence, when trying to classify or detect moving objects from images, it is imperative that the classifier is robust against reasonable levels of motion blur. We implement the attack as follows. First, we segment the bird from the original image, then use a blur filter and lastly, we paste the blurred bird back onto the background. We are able to apply more severe blur, by enlarging the kernel of the filter. See Figure \ref{fig:motion_blur_panel} for an ablation study of the kernel size. 

The motion blur filter is implemented as follows. We use a kernel of size $M \times M$ and build the filter as follows: we fill the row $(M-1)/2$ of the kernel with the value $1/M$. Thereafter, we use the 2D convolution implementation of OpenCV (filter2D) \cite{opencv_library} to convolute the kernel with the image. Note that applying a rotation before the convolution to the kernel, changes the direction of the resulting motion blur. Lastly, we find the most detrimental level of motion blur using a list-search over all levels up to $M_{max}$. 


\begin{figure*}[!t]
\centering
\begin{subfigure}[b]{0.19\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/waterbird_original_example.png}
  \caption{Original}
  \label{fig:motion_blur_or}
\end{subfigure}
\begin{subfigure}[b]{0.19\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/motion_blur_5.png}
  \caption{$M = 5$}
  \label{fig:motion_blur_5}
\end{subfigure}
\begin{subfigure}[b]{0.19\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/motion_blur_10.png}
  \caption{$M = 10$}
  \label{fig:motion_blur_10}
\end{subfigure}
\begin{subfigure}[b]{0.19\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/motion_blur_15.png}
  \caption{$M = 15$}
  \label{fig:motion_blur_15}
\end{subfigure}
\begin{subfigure}[b]{0.19\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/motion_blur_20.png}
  \caption{$M = 20$}
  \label{fig:motion_blur_20}
\end{subfigure}
\caption{We perform an ablation study of the motion blur kernel size, which corresponds to the severity level of the blur. We see that for increasing $M$, the severity of the motion blur increases. In particular, note that for $M = 15$ and even $M = 20$, the bird remains recognizable: we do not semantically change the class, i.e. the perturbations are consistent.}
\label{fig:motion_blur_panel}
\end{figure*}

\begin{figure*}[!b]
\centering
\begin{subfigure}[b]{0.136\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/bird_light_03_dark.png}
  \caption{$\epsilon = -0.3$}
  \label{fig:dark_03}
\end{subfigure}
\begin{subfigure}[b]{0.136\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/bird_light_02_dark.png}
  \caption{$\epsilon = -0.2$}
  \label{fig:dark_02}
\end{subfigure}
\begin{subfigure}[b]{0.136\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/bird_light_01_dark.png}
  \caption{$\epsilon = -0.1$}
  \label{fig:dark_01}
\end{subfigure}
\begin{subfigure}[b]{0.136\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/waterbird_original_example.png}
  \caption{Original}
  \label{fig:light_or}
\end{subfigure}
\begin{subfigure}[b]{0.136\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/bird_light_01_light.png}
  \caption{$\epsilon = 0.1$}
  \label{fig:light_01}
 \end{subfigure}
\begin{subfigure}[b]{0.136\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/bird_light_02_light.png}
  \caption{$\epsilon = 0.2$}
  \label{fig:light_02}
\end{subfigure}
\begin{subfigure}[b]{0.136\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/bird_light_03_light.png}
  \caption{$\epsilon = 0.3$}
  \label{fig:light_03}
\end{subfigure}
\caption{We perform an ablation study of the different lighting changes of the adversarial illumination attack. Even though the directed attack\xspace attacks the signal component in the image, the bird remains recognizable in all cases.}
\label{fig:light_panel}
\end{figure*}

\paragraph{Specifics to the adversarial illumination attack} 
An adversary can hide objects using poor lightning conditions, which can for example arise from shadows or bright spots. To model poor lighting conditions on the object only (or targeted to the object), we use the adversarial illumination attack. 
The attack is constructed as follows: First, we segment the bird from their background. Then we apply an additive constant $\epsilon$ to the bird, where the absolute size of the constant satisfies $|\epsilon| < \eps_{\text{te}} = 0.3$. Thereafter, we clip the values of the bird images to $[0, 1]$, and lastly, we paste the bird back onto the background. See Figure \ref{fig:light_panel} for an ablation of the parameter $\epsilon$ of the attack. It is non-trivial how to (approximately) find the worst perturbation. We find an approximate solution by searching over all perturbations with increments of size $\eps_{\text{te}}/K_{\text{max}}$. Denote by seg\xspace, the segmentation profile of the image $x$. We consider all perturbed images in the form of
\begin{equation*}
x_{pert} = (1-seg) x + seg (x + \epsilon \frac{K}{K_{\text{max}}}  1_{255 \times 255}), \Hquad K \in [-K_{max}, K_{max}].
\end{equation*} 
During training time we set $K_{max} = 16$ and therefore search over $33$ possible images. During test time we search over $65$ images ($K_{max} = 32$).

\paragraph{Early stopping} In all our experiments on the Waterbirds dataset, a parameter search lead to an optimal weight-decay and learning rate of $10^{-4}$ and $0.006$ respectively. Another common regularization technique is early stopping, where one stops training on the epoch where the classifier achieves minimal robust error on a hold-out dataset. To understand if early stopping can mitigate the effect of adversarial training aggregating robust generalization in comparison to standard training, we perform the following experiment. On the Waterbirds dataset of size $n = 20$ and considering the adversarial illumination attack, we compare standard training with early stopping and adversarial training $(\eps_{\text{tr}} = \eps_{\text{te}} = 0.3)$ with early stopping. Considering several independent experiments, early stopped adversarial training has an average robust error of $33.5$ a early stopped standard training $29.1$. Hence, early stopping does decrease the robust error gap, but does not close it. 

\paragraph{Error decomposition with increasing $n$}

In Figure \ref{fig:waterbirds_light_numobs}, we see that adversarial training hurts robust generalization in the small sample size regime. For completeness, we plot the robust error composition for adversarial and standard training in Figure \ref{fig:light_numsamp_decomposition}. We see that in the low sample size regime, the drop in susceptibility that adversarial training achieves in comparison to standard training, is much lower than the increase in standard error. Conversely, in the high sample regime, the drop of susceptibility from adversarial training over standard training is much bigger than the increase in standard error. 

\begin{figure*}[!t]
\centering
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/numsamp_waterbirds_light.png}
  \caption{Robust error}
  \label{fig:app_waterbirds_robust_error}
\end{subfigure}
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/waterbirds_standard_numsamp.png}
  \caption{Standard error}
  \label{fig:app_waterbirds_standard_error}
\end{subfigure}
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/waterbirds_susceptibility_decomposition.png}
  \caption{Susceptibility}
  \label{fig:app_waterbirds_susceptibility}
\end{subfigure}
\caption{We plot the robust error decomposition of the experiments depicted in Figure \ref{fig:waterbirds_light_numobs}. The plots depict the mean and standard deviation of the mean over several independent experiments. We see that, in comparison to standard training, the reduction in susceptibility for adversarial training is minimal in the low sample size regime. Moreover, the increase in standard error of adversarial training is quite severe, leading to an overall increase in robust error in the low sample size regime.}
\label{fig:light_numsamp_decomposition}
\end{figure*}
\section{Future work}


This paper aims to caution the practitioner against blindly following
current widespread practices to increase the robust
performance of machine learning models.
Specifically, adversarial training is currently recognized to be one
of the most effective defense mechanisms for $\ell_p$-perturbations,
significantly outperforming robust performance of standard training.  However, we prove that
this common wisdom is not applicable for directed attacks -- that are perceptible (albeit consistent) but efficiently focus their
attack budget to target ground truth class information -- in the low-sample size regime.
In particular, in such settings adversarial training can in fact yield worse accuracy than standard training.






In terms of follow-up work on directed attacks in the low-sample
regime, there are some concrete questions that would be interesting to
explore.  For example, as discussed in Section~\ref{sec:relatedwork},
it would be useful to test whether some methods to mitigate the
standard accuracy vs. robustness trade-off would also relieve the
perils of adversarial training for directed attacks. Further, we
hypothesize, independent of the attack during test time, it is
important in the small sample-size regime to choose perturbation sets
during training that align with
the ground truth signal (such as rotations for data with inherent
rotation). If this hypothesis were to be confirmed, it would break
with yet another general rule that the best defense perturbation type
should always match the attack during evaluation.  The insights from
this study might also be helpful in the context of searching for
good defense perturbations.




\section{Discussion and related work}





\section{Introduction}
\label{sec:intro}
\begin{wrapfigure}{r}{0.43\textwidth}
\centering
\vspace{-0.1in}
\includegraphics[width=0.99\linewidth]{plotsAistats/teaser_try_2.png}
\caption{
 
  On subsampled \mbox{CIFAR10} attacked by $2\times 2$ masks, adversarial training yields higher robust error than standard training
  when the sample size is small, even though it helps for large sample sizes.
  (see Sec.~\ref{sec:app_cifar10} for details).}
 
 
  \vspace{-0.2in}
\label{fig:teaserplot}
\end{wrapfigure}

Today's best-performing classifiers are vulnerable to adversarial attacks
\cite{goodfellow15, szegedy14} and exhibit high \emph{robust error}: for many inputs, their predictions change under adversarial perturbations,
even though the true class stays the same. 
For example, in image classification tasks, we distinguish between two categories of
such attacks that are content-preserving \cite{gilmer18b} (or consistent \cite{raghunathan20}) if their strength is limited --- perceptible and imperceptible perturbations.
Most work to date studies imperceptible attacks such as 
bounded $\ell_p$-norm perturbations \cite{goodfellow15, madry18, moosavi16}, small transformations using image processing
techniques \cite{ghiasi19, zhao20, laidlaw21, Luo18} or 
nearby samples on the data manifold \cite{Lin20, Zhou20}.
They can often use their limited budget to successfully fool a learned classifier but, by definition, do not visibly reduce information about the actual class: the object in the perturbed image looks exactly the same as in the original version.

On the other hand, perceptible perturbations may occur more naturally in practice or are physically realizable. 
For example, stickers can be placed on traffic signs \cite{Eykholt18},
masks of different sizes may cover important features of human faces
\cite{Wu20}, images might be rotated or translated \cite{Logan19},
animals in motion may appear blurred in photographs
depending on the shutter speed, or the lighting conditions could be poor (see Figure~\ref{fig:sig_att_examples}).
Some perceptible attacks can effectively
use the perturbation budget to reduce actual class information
in the input (the \emph{signal}) while still preserving the original class.
For example, a stop sign with a small sticker doesn't lose its semantic meaning
or a flying bird does not become a different species because it induces motion blur
in the image.
We refer to these attacks as \emph{directed attacks\xspace} (see
Section~\ref{sec:robustness} for a more formal
characterization). 

In this paper, we
demonstrate that one of the most common beliefs 
for adversarial attacks does not transfer to directed attacks\xspace, in
particular when the sample size is small. Specifically, it is widely acknowledged that adversarial training often achieves significantly lower adversarial error than standard
training. This holds in particular if
the perturbation type 
\cite{madry18, zhang19, Bai21} and perturbation budget match the attack during test time. 
Intuitively, the improvement is a result of decreased
\emph{attack-susceptibility}: independent of the true class,
adversarial training explicitly encourages the classifier to predict
the same class for all perturbed points.





























\begin{figure}[t]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.2\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/CIFAR10_class_boat.png}
  \caption{Masking}
  \label{fig:CIFAR10_boat}
\end{subfigure}
\begin{subfigure}[b]{0.2\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/light_darker.png}
  \caption{Illumination}
  \label{fig:WB_light_dark}
\end{subfigure}
\begin{subfigure}[b]{0.2\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/water-bird_motion_blurred.png}
  \caption{Motion blur}
  \label{fig:WB_motion_blur}
\end{subfigure}
\begin{subfigure}[b]{0.2\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/waterbird_original_example.png}
  \caption{Original}
  \label{fig:fig:WB_original}
\end{subfigure}
\caption{Examples of directed attacks\xspace on CIFAR10 and the
  Waterbirds dataset. In Figure \ref{fig:CIFAR10_boat}, we corrupt the image with a black mask of size $2 \times 2$ and in Figure \ref{fig:WB_light_dark} and \ref{fig:WB_motion_blur} we change the lighting conditions (darkening) and apply motion blur on the bird in the image respectively. 
 
  All perturbations effectively reduce the information about the class in the images: they are the result of directed attacks\xspace.}
\label{fig:sig_att_examples}
\end{center}
\vskip -0.2in
\end{figure}


In this paper, we question the efficacy of adversarial training to increase
robust accuracy for directed attacks\xspace.
In particular, we show that adversarial training not only increases standard test error as noted in \cite{zhang19, tsipras19, Stutz19, raghunathan20}), but surprisingly,
\begin{center}
 \emph{adversarial training may even increase the robust test error compared to standard training!}
\end{center}
Figure \ref{fig:teaserplot} illustrates the main message of our paper for CIFAR10 subsets: Although adversarial training
outperforms standard training when enough training samples are available, it is inferior
in the low-sample regime.  
More specifically, our contributions are as follows:
\begin{itemize}
\item We prove that, almost surely, adversarially training a linear classifier on separable data yields a monotonically increasing robust error as the perturbation budget grows. 
We further establish high-probability non-asymptotic lower bounds on the robust error gap between adversarial and standard training.
 
\item Our proof provides intuition for why this phenomenon is particularly prominent for directed attacks\xspace in the small sample size regime.
 
\item We show that this phenomenon occurs on a variety of real-world datasets and perceptible directed attacks\xspace in the small sample size regime.
 
\end{itemize}






\section{Real-world experiments}
\label{sec:realworldexpapp}

In this section, we demonstrate that adversarial training may
hurt robust accuracy in a variety of image attack scenarios
on the Waterbirds and CIFAR10 dataset.
The corresponding experimental details and more experimental results (including
on an additional hand gestures dataset) can be found in Appendices
 \ref{sec:waterbirds}, \ref{sec:app_cifar10} and \ref{sec:handgestures}.


\subsection{Datasets}

We now describe the datasets and models that we use for the
experiments. In all our experiments on CIFAR10, we vary the sample
size by subsampling the dataset and use a ResNet18 \cite{He16} as
model. We always train on the same (randomly subsampled) dataset,
meaning that the variances arise from the random seed of the model and
the randomness in the training algorithm. In Appendix
\ref{sec:app_cifar10}, we complement the results of this section by
reporting the results of similar experiments with different
architectures.


As a second dataset, we build a new version of the Waterbirds
dataset, consisting of images of water- and
landbirds of size $256 \times 256$ and labels that distinguish the
two types of birds. We construct the dataset as follows: First, we
sample equally many water- and landbirds from the CUB-200 dataset
\cite{Welinder10}. Then, we segment the birds and paste them onto a
background that is randomly sampled (without replacement) from the Places-256 dataset \cite{zhou17}.
For the implementation of the dataset we used the code provided by \citet{Sagawa20}. Also, following the choice of \citet{Sagawa20}, we use as model a ResNet50 that was pretrained on ImageNet and which achieves near perfect standard accuracy.


\subsection{Evaluation of directed attacks\xspace}

We consider three types of directed attacks\xspace on our real world datasets:
square masks, motion blur and adversarial illumination. The mask
attack is a model used to simulate sticker-attacks and general
occlusions of objects in images \cite{Eykholt18, Wu20}. On the other
hand, motion blur may arise naturally for example when photographing
fast moving objects with a slow shutter speed. Further, adversarial
illumination may result from adversarial lighting conditions or smart
image corruptions. Next, we describe the attacks in more detail.

\paragraph{Mask attacks}
On CIFAR10, we consider the square black mask attack: the adversary can set a mask
of size $\eps_{\text{te}} \times \eps_{\text{te}}$ to zero in the image. To ensure that the mask does not cover the whole signal in the image, we
restrict the size of the masks to be at most $2 \times 2$. Hence, the search space of the attack consists of all possible locations of the masks in the targeted image. For exact robust error evaluation, we perform a full grid search over all possible locations during test time. See Figure \ref{fig:CIFAR10_boat} for an example of a mask attack on CIFAR10.

\paragraph{Motion blur}
On the Waterbirds dataset we consider two directed attacks\xspace: motion blur and adversarial illumination. For the motion blur attack,
the bird may move at different speeds without changing the background. 
The aim is to be robust against all motion blur severity levels up to $M_{max} = 15$. 
To simulate motion blur, we first segment the birds and then use a filter with a kernel of size $M$ to apply motion blur on the bird only. Lastly, we paste the blurred bird back onto the background image. We can change the severity level of the motion blur by increasing the kernel size of the filter.
See Appendix \ref{sec:waterbirds} for an ablation study and concrete expressions of the motion blur kernel. At test time, we perform a full grid search over all kernel sizes to exactly evaluate the robust error. We refer to Figure \ref{fig:WB_motion_blur} and Section \ref{sec:waterbirds} for examples of our motion blur attack.

\paragraph{Adversarial illumination} As a second attack on the Waterbirds dataset, we consider adversarial illumination. The adversary can darken or brighten the bird without corrupting the background of the image. The attack aims to model images where the object at interest is hidden in shadows or placed against bright light. 
To compute the adversarial illumination attack, we segment the bird, then darken or brighten the it, by adding a constant $a \in [-\eps_{\text{te}}, \eps_{\text{te}}]$, before pasting the bird back onto the background image. We find the most adversarial lighting level, i.e. the value of $a$, by equidistantly partitioning the interval $[-\eps_{\text{te}}, \eps_{\text{te}}]$ in $K$ steps and performing a full list-search over all steps.
See Figure \ref{fig:WB_light_dark} and Section \ref{sec:waterbirds} for examples of the adversarial illumination attack.


\subsection{Adversarial training procedure}

For all datasets, we run SGD until convergence on the \emph{robust} cross-entropy
loss~\eqref{eq:emploss}. In each iteration, we search for an adversarial example
and update the weights using a gradient with respect to the resulting
perturbed example \cite{goodfellow15, madry18}.
For every experiment, we choose the learning
rate and weight decay parameters that minimize the robust error on a
hold-out dataset. We now describe the implementation of the
adversarial search for the three types of
directed attacks\xspace. 

\paragraph{Mask attacks}
Unless specified otherwise, we use an approximate attack similar to
\citet{Wu20} during training time:
First, we identify promising mask locations by analyzing the gradient, $\nabla_x L(f_\theta(x), y)$, of the cross-entropy loss with respect to the input. Masks that cover part of the image where the gradient is large, are more likely to increase the loss. Hence, we compute the $K$ mask locations $(i, j)$, where $\|\nabla_x L(f_\theta(x), y)_{[i:i+2, j:j+2]} \|_1$ is the largest and take using a full list-search the mask that incurs the highest loss.
Our intuition from the theory predicts that higher $K$,
and hence a more exact ``defense'', only increases the robust error of
adversarial training, since the mask could then more efficiently cover
important information about the class. We indeed confirm this effect
and provide more details in Section~\ref{sec:app_cifar10}.

\paragraph{Motion blur}
Intuitively the worst attack should be the most severe blur, rendering
a search over a range of severity superfluous.  However, similar to
rotations, this is not necessarily true in practice since the training loss on
neural networks is generally nonconvex. Hence, during training time,
we perform a search over kernels with sizes $2i$ for $i = 1,\dots,
M_{max}/2$. Note that, at test time, we do an exact search
over all kernels of sizes in $[1, 2, \dots, M_{max}]$.

\paragraph{Adversarial illumination}
Similar to the motion blur attack, intuitively the worst perturbation
should be the most severe lighting changes; either darkening or
illuminating the object maximally. However, again this is not
necessarily the case, since finding the worst attack is a nonconvex
problem. Therefore, during training and testing we partition the
interval $[-\eps_{\text{tr}}, \eps_{\text{tr}}]$ in $33$ and $65$ steps
respectively, and perform a full grid-search to find the worst
perturbation.

\subsection{Adversarial training can hurt robust generalization}

Further, we perform the following experiments on the Waterbirds dataset using the motion blur and adversarial illumination attack. We vary the adversarial training  budget $\eps_{\text{tr}}$, while keeping the number of samples fixed, and compute the resulting robust error.
We see in Figure \ref{fig:waterbirds_light_d_n} and \ref{fig:motion_lines} that, indeed, adversarial training can hurt robust generalization with increasing perturbation budget $\eps_{\text{tr}}$.

Furthermore, to gain intuition as described in Section~\ref{logreg_proof_sketch} and, we also plot the robust error decomposition (Equation~\ref{eq:decomposition}) consisting of the standard error and susceptibility in Figure \ref{fig:light_trade_off} and \ref{fig:motion_blur_trade_off}. Recall that we measure susceptibility as the fraction of data points in the test set for
which the classifier predicts a different class under an adversarial attack.
As in our linear example, we observe an increase in robust error despite a slight drop
in susceptibility, because of the more severe increase in standard error. 
Similar experiments for the hand gesture dataset can be
found in~\ref{sec:handgestures}. 


\begin{figure*}[!t]
\centering
\begin{subfigure}[b]{0.4\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/waterbirds_motion_d_n.png}
  \caption{Robust error with increasing $\eps_{\text{tr}}$}
  \label{fig:motion_lines}
\end{subfigure}
\begin{subfigure}[b]{0.4\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/waterbirds_trade-off.png}
  \caption{Robust error decomposition}
  \label{fig:motion_blur_trade_off}
\end{subfigure}
  \caption{ (a) We plot the robust error with increasing adversarial training budget $\eps_{\text{tr}}$ of $5$ experiments on the subsampled Waterbirds datasets of sample sizes $20$ and $30$. Even though adversarial training hurts robust generalization for low sample size ($n = 20$), it helps for $n = 50$.  (b) We plot the decomposition of the robust error in standard error and susceptibility with increasing adversarial budget $\eps_{\text{tr}}$. We plot the mean and standard deviation of the mean of $5$ experiments on a subsampled Waterbirds dataset of size $n = 20$. The increase in standard error is more severe than the drop in susceptibility, leading to a slight increase in robust error. For more experimental details see Section \ref{sec:waterbirds}.}
\label{fig:motion_blur_real_world}
\end{figure*}



As predicted by our theorem, the phenomenon where adversarial training hurts robust generalization is most pronounced in the small sample size regime. Indeed, the experiments depicted in Figures \ref{fig:waterbirds_light_d_n} and \ref{fig:motion_lines} are conducted on small sample size datasets of $n = 20$ or $50$.
In Figure \ref{fig:teaserplot} and \ref{fig:waterbirds_light_numobs}, we
observe that the as sample size increases,  adversarial training does improve robust generalization compared to standard training, even for directed attacks\xspace. Moreover, on the experiments of CIFAR10 using the mask perturbation, which can be found in Figure \ref{fig:teaserplot} and Appendix \ref{sec:app_cifar10}, we observe the same behaviour: Adversarial training hurts robust generalization in the low sample size regime, but helps when enough samples are available. 


\subsection{Discussion}

In this section, we discuss how different algorithmic choices, motivated
by related work, affect when and how adversarial training hurts robust generalization. 

\paragraph{Strength of attack and catastrophic overfitting}
In many cases, the worst case perturbation during adversarial training is found using an approximate algorithm such as projected gradient descent. It is common belief  that using the strongest attack (in the mask-perturbation case, full grid search) during training should also result in better robust generalization. 
In particular, the literature on catastrophic overfitting shows that weaker attacks during training lead to bad performance on stronger attacks during testing  \cite{Wong20Fast, andriushchenko20, li21}.
Our result suggests the opposite is true in the low-sample size regime for
directed attacks\xspace : the weaker the attack, the better
adversarial training performs.






  


\paragraph{Robust overfitting}
Recent work observes empirically \cite{rice20} and theoretically
\cite{sanyal20, donhauser21}, that perfectly minimizing the
adversarial loss during training might in fact be suboptimal for
robust generalization; that is, classical regularization techniques
might lead to higher robust accuracy. The phenomenon is often referred
to as robust overfitting. May the phenomenon be mitigated using
standard regularization techniques?  In Appendix \ref{sec:waterbirds} we shed light on
this question and show that adversarial training hurts robust generalization even with standard regularization methods such as early stopping are used.


\section{Related work}
\label{sec:relatedwork}

We now discuss how our results relate to phenomena that have been observed or proven in the literature before.

\paragraph{Robust and non-robust useful features}
In the words of \citet{ilyas19, springer21}, for
directed attacks, all robust features become less useful, but adversarial
training uses robust features more.  In the small sample-size regime
$n<d-1$ in particular, robust learning assigns so much weight
on the robust (possibly non-useful) features, that the signal in the non-robust
features is drowned. This leads to an unavoidable and large increase
in standard error that dominates the decrease in susceptibility and
hence ultimately leads to an increase of the robust error.

\paragraph{Small sample size and robustness}
A direct consequence of Theorem~\ref{thm:linlinf} is that in order to
achieve the same robust error as standard training, adversarial
training requires more samples. This statement might remind the reader
of sample complexity results for robust generalization in
\citet{schmidt18, Yin19, Khim18}. While those results compare sample
complexity bounds for standard vs. robust error, our theorem
statement compares two algorithms, standard vs. adversarial training,
with respect to the robust error.


\paragraph{Trade-off between standard and robust error} 

Many papers observed that even though adversarial training decreases robust error compared to standard training, it may lead
to an increase in standard test error \cite{madry18, zhang19}.  
For example, \citet{tsipras19, zhang19, javanmard20, dobriban20, chen20} study settings where the Bayes optimal robust classifier is not equal to the Bayes optimal (standard)
classifier (i.e. the perturbations are inconsistent or the dataset is non-separable).
\cite{raghunathan20} study consistent perturbations, as in our paper,
and prove that for small sample size, fitting adversarial
examples can increase standard error even in the absence of
noise. In contrast to aforementioned works, which do not refute that
adversarial training decreases robust error, we prove that for
directed attacks\xspace perturbations, in the small sample regime adversarial training may also increase \emph{robust error}.

\paragraph{Mitigation of the trade-off} 
A long line of work has proposed procedures to 
mitigate the trade-off phenomenon.  For example \citet{alayrac19,
  Carmon19, zhai20, raghunathan20} study robust self training, which
leverages a large set of unlabelled data, while \citet{lee20, lamb19,
  xu20} use data augmentation by interpolation. \citet{Ding20,
  balaji19, Cheng20} on the other hand propose to use adaptive
perturbation budgets $\eps_{\text{tr}}$ that vary across inputs. 
Our intuition from the theoretical analysis suggests that the standard
mitigation procedures for imperceptible perturbations may not work for
perceptible directed attacks\xspace, because all relevant features are non-robust.
We leave a thorough empirical study
as interesting future work.












\section{Robust classification}
\label{sec:robustness}



 


We first introduce our robust classification setting more formally by defining
the notions of  adversarial robustness, directed attacks\xspace and adversarial training
used throughout the paper.


\paragraph{Adversarially robust classifiers}

For inputs $x \in \mathbb{R}^d$, we consider multi-class classifiers
associated with parameterized functions $f_\theta:\mathbb{R}^d \to
\mathbb{R}^K$, where $K$ is the number of labels. In the special case of binary classification ($K = 2$), we use the output predictions $y=\textrm{sign}(f_\theta(x))$. For example, $f_\theta(x)$ could be linear models (as in Section~\ref{sec:theoryresults}) or
neural networks (as in Section~\ref{sec:realworldexpapp}).

One key step to encourage deployment of machine learning based classification in real-world applications, is to increase the
robustness of classifiers against perturbations that do not
change the ground truth label. 
Mathematically speaking, we would like to have a small
\emph{$\eps_{\text{te}}$-robust error}, defined as
\begin{equation}
  \label{eq:roberr}
  \roberr{\theta} := \mathbb{E}_{(x, y)\sim \mathbb{P}} \max_{x' \in \pertset{x}{\eps_{\text{te}}}} \ell(f_\theta (x'),y),
\end{equation}
where $\ell$ is the multi-class zero-one loss, which only equals $1$ if the predicted
output using $f_\theta(x)$ does not match the true label $y$.
Further, $\pertset{x}{\eps_{\text{te}}}$ is a perturbation set associated with a \emph{transformation type} and size $\eps_{\text{te}}$. 
Note that the \emph{(standard) error} of a classifier corresponds to evaluating $\roberr{\theta}$ at $\eps_{\text{te}} = 0$, yielding the standard error $\stderr{\theta} =\mathbb{E}_{(x, y)\sim \mathbb{P}} \ell(f_\theta (x),y)$.

\paragraph{(Signal)-Directed attacks}
Most works in the existing literature consider consistent perturbations where
$\eps_{\text{te}}$ is small enough such that all samples in the perturbation set
have the same ground truth or expert label. 
Note that the ground truth model $f_{\theta^{\star}}$ is therefore robust against perturbations and achieves the same error for standard and adversarial evaluation. 
The inner maximization in Equation~\eqref{eq:roberr} is often called the adversarial \emph{attack} of the model $f_\theta$ and the corresponding solution is referred to as the adversarial example.
In this paper, we consider \emph{directed attacks\xspace}, as described in Section~\ref{sec:intro}, that effectively reduce the information about the ground truth classes.
Formally, we characterize \emph{directed attacks\xspace} by the following property: 
for any model $f_\theta$ with low standard error, the corresponding adversarial example is well-aligned with the adversarial example found using the ground truth model 
$f_{\theta^{\star}}$.
An example for such an attack are additive perturbations that are constrained to the direction of the ground truth decision boundary.  We provide concrete examples for linear classification in  Section~\ref{logreg_linear_model}.
 





















 

















\paragraph{Adversarial training}



In order to obtain classifiers with a good robust accuracy, it is
common practice to minimize a (robust) training objective $\mathcal{L}_{\eps_{\text{tr}}}$ with a surrogate
classification loss $L$ such as
\begin{equation}
  \label{eq:emploss}
  \robloss{\theta} :=  \frac{1}{n} \sum_{i=1}^n \max_{x_i' \in \pertset{x_i}{\eps_{\text{tr}}}} L(f_\theta(x_i') y_i),
\end{equation}
which is called adversarial training.  In practice, we often use the
cross entropy loss $L(z) = \log (1+ \text{e}^{-z})$ and minimize the
robust objective by using first order optimization methods such as
(stochastic) gradient descent.  SGD is also the algorithm that we
focus on in both the theoretical and experimental sections.


When the desired type of robustness is known in advance, it is
standard practice to use the same perturbation set for training as for
testing, i.e. $\pertset{x}{\eps_{\text{tr}}}=\pertset{x}{\eps_{\text{te}}}$. For example, \citet{madry18} shows that the robust error sharply increases for $\eps_{\text{tr}} < \eps_{\text{te}}$.
In this paper, we show that for directed attacks\xspace in the small sample size regime, in fact, the opposite is true.



\section{Theoretical results}
\label{sec:theoryresults}
In this section, we prove for linear functions $f_\theta(x) =
\theta^\top x$ that  in the case of directed attacks, robust
generalization deteriorates with increasing $\eps_{\text{tr}}$.
The proof, albeit in a simple setting, provides
explanations for why adversarial training fails in the
high-dimensional regime for such attacks.
 
 


\subsection{Setting}
\label{logreg_linear_model}

We now introduce the precise linear setting used in our theoretical results.




\paragraph{Data model}
In this section, we assume that the ground truth and hypothesis class
are given by linear functions $f_\theta(x) = \theta^\top x$ and the
sample size $n$ is lower than the ambient dimension $d$.  In
particular, the generative distribution $\mathbb{P}_r$ is similar to
\cite{tsipras19, kolter19}: The label $y \in \{+1, -1\}$ is drawn with
equal probability and the covariate vector is sampled as $x =
[y\frac{r}{2}, \tilde{x}]$ with the random vector $\tilde{x} \in
\mathbb{R}^{d-1}$ drawn from a standard normal distribution,
i.e. $\tilde{x} \sim \mathcal{N}(0, \sigma^2 I_{d-1})$. We would like to
learn a classifier that has low robust error by using a dataset
$D = {(x_i, y_i)}_{i=1}^n$ with $n$ i.i.d. samples from
$\mathbb{P}_{r}$.

Notice that the distribution $\mathbb{P}_{r}$ is noiseless: for a given input
$x$, the label $y = \sign(\xind{1})$ is deterministic. Further, the
optimal linear classifier (also referred to as the \emph{ground
  truth}) is parameterized by $\theta^{\star} = e_1$.\footnote{Note that the result more generally holds for non-sparse models that are not axis aligned by way of a simple rotation $z = U x$. In that case the distribution is characterized by $\theta^\star = u_1$ and a rotated Gaussian in the $d-1$ dimensions orthogonal to $\theta^\star$.} By definition, the ground truth is
robust against all consistent perturbations and hence the optimal
robust classifier.





\paragraph{Directed attacks\xspace}  
The focus in this paper lies on consistent directed attacks\xspace that by
definition efficiently concentrate their attack budget in the
direction of the signal.  For our linear setting, we can model such
attacks by  additive perturbations in the first dimension
\begin{equation}
  \label{eq:linfmaxpert}
  \pertset{x}{\epsilon} = \{x'=x+\delta  \mid \delta = \beta e_1 \text{ and } -\epsilon \leq \beta\leq \epsilon\}.
\end{equation}
Note that this attack is always in the direction of the true signal dimension, i.e. the ground truth. Furthermore, when  $\epsilon < \frac{r}{2}$, it is a consistent directed attack\xspace.
Observe how this is different from $\ell_p$ attacks - an $\ell_p$ attack, depending on the model, may add a perturbation that only has a very small component in the signal direction. 




\paragraph{Robust max-$\ell_2$-margin classifier}

A long line of work studies the implicit bias of interpolators
that result from applying stochastic gradient descent on the logistic loss until convergence \cite{liu20, Ji19, Chizat20, nacson19}.
For linear models, we obtain the $\eps_{\text{tr}}$-robust maximum-$\ell_2$-margin solution (\emph{robust max-margin} in short) 
\begin{equation}
  \label{eq:maxmargin}
  \thetahat{\eps_{\text{tr}}} := \argmax_{\|\theta\|_2\leq 1} \min_{i\in [n], x_i' \in \pertset{x_i}{\eps_{\text{tr}}}} y_i \theta^\top x_i'.
\end{equation}
This can for example be shown by a simple rescaling
argument using Theorem 3.4 in \cite{liu20}.  Even though our result is proven for the max-$\ell_2$-margin classifier,
it can easily be extended to other interpolators.



\begin{figure*}[!t]
  \centering
\begin{subfigure}[b]{0.3\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/gap_lower_final_main_theorem.png}
  \caption{Robust error increase with $\eps_{\text{tr}}$}
  \label{fig:main_lower_bound_eps}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/robust_error_ST_AT.png}
  \caption{Standard-adversarial training}
  \label{fig:main_numobs}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/gap_final_good_colours.png}
  \caption{Effect of over-parameterization}
  \label{fig:main_numobs_bound}
\end{subfigure}
\caption{Experimental verification of Theorem \ref{thm:linlinf}.
(a) We set $d = 1000$, $r = 12$, $n = 50$ and plot the robust error gap between standard and adversarial training with increasing adversarial budget $\eps_{\text{tr}}$ of $5$ independent experiments. For comparison, we also plot the lower bound given in Theorem \ref{thm:linlinf}. In (b) and (c), we set $d = 10000$ and vary the number of samples $n$. (b) We plot the robust error of standard and adversarial training ($\eps_{\text{tr}} = 4.5$). (c) We compute the error gap and the lower bound of Theorem \ref{thm:linlinf}. For more experimental details see Appendix~\ref{sec:logregapp}.}
  \vspace{-0.2in}
\label{fig:main_theorem}
\end{figure*}

\subsection{Main results}
\label{logreg_main_theorem}


We are now ready to characterize the
$\eps_{\text{te}}$-robust error as a function of $\eps_{\text{tr}}$, the separation
$r$, the dimension $d$ and sample size $n$ of the
data. In the theorem statement we use the following quantities
\begin{align*}
      \varphi_{\text{min}} &= \frac{\sigma}{r/2-\eps_{\text{te}}}  \left(  \sqrt{\frac{d-1}{n}} - \left(1 + \sqrt{\frac{2 \log (2/\delta)}{n}}\right)\right)\\
      \varphi_{\text{max}} &= \frac{\sigma}{r/2-\eps_{\text{te}}}  \left(  \sqrt{\frac{d-1}{n}} + \left(1 + \sqrt{\frac{2 \log (2/\delta)}{n}}\right)\right)
\end{align*}
that arise from concentration bounds for the singular values of the random data matrix. Further, let $\tilde{\epsilon} := \frac{r}{2} - \frac{\varphi_{\text{max}}}{\sqrt{2}}$ and denote by
 $\Phi$ the cumulative distribution function of a standard normal.
\begin{theorem}
  \label{thm:linlinf}
 
  Assume $d-1>n$. 
  For any $\eps_{\text{te}} \geq 0$, the $\eps_{\text{te}}$-robust error on test samples from $\mathbb{P}_{r}$ with $2 \eps_{\text{te}} < r$ and perturbation sets in Equation~\eqref{eq:linfmaxpert} and~\eqref{eq:l1maxpert}, the following holds:
  \begin{enumerate}
  \item
   
   
      The $\eps_{\text{te}}$-robust error of the $\eps_{\text{tr}}$-robust max-margin estimator reads
    \begin{equation}
      \roberr{\thetahat{\eps_{\text{tr}}}} = \Phi \left( -\frac{\left( \frac{r}{2}-\eps_{\text{tr}} \right) }{\tilde{\varphi}} \right)
    \end{equation}
   
   
   
    for a random quantity $\tilde{\varphi}>0$ depending on $\sigma, r,\eps_{\text{te}}$, which is a strictly increasing function with respect to $\eps_{\text{tr}}$.
   
   
   
   
   
   
     
 
 
 
 
 
 

  \item
   
    With probability at least $1-\delta$, we further have $\varphi_{\text{min}} \leq \tilde{\varphi}\leq \varphi_{\text{max}}$ and  the following lower bound on the robust error increase by adversarially training with size $\eps_{\text{tr}}$
    \begin{equation}
      \roberr{\thetahat{\eps_{\text{tr}}}} - \roberr{\thetahat{0}}
      \geq 
     
      \Phi \left(\frac{r/2}{\varphi_{\text{min}}} \right) - \Phi \left(  \frac{r/2 -\min\{\eps_{\text{tr}}, \tilde{\epsilon}\}}{ \varphi_{\text{min}}} \right).
   
    \end{equation}
   
   
   
   
   
   
   
    
   

   
   
   
   
   
   
   
   
   
   
   
   
   
   
 
 
 
 
 
 
 
 
 
 
 
 
 
  \end{enumerate}
\end{theorem}





 


The proof can be found in Appendix~\ref{sec:app_theorylinear} and
primarily relies on high-dimensional probability. Note that the
theorem holds for any $0\leq \eps_{\text{te}} <\frac{r}{2}$ and hence
also directly applies to the standard error by setting $\eps_{\text{te}} =
0$. In Figure~\ref{fig:main_theorem}, we empirically confirm the statements of Theorem \ref{thm:linlinf} by performing multiple experiments on synthetic datasets as described in Subsection \ref{logreg_linear_model} with different choices of $d/n$ and $\eps_{\text{tr}}$. 
In the first statement, we prove that for small
sample-size ($n<d-1$) noiseless data,

almost surely, the robust error increases monotonically with
adversarial training budget $\eps_{\text{tr}} >0$.
In Figure~\ref{fig:main_lower_bound_eps}, we plot the robust error gap between standard and adversarial logistic regression in function of the adversarial training budget $\eps_{\text{tr}}$ for $5$ runs. 

The second statement establishes a simplified lower bound on the
robust error increase for adversarial training (for a fixed
$\eps_{\text{tr}} = \eps_{\text{te}}$)  compared to standard training.
In Figures~\ref{fig:main_lower_bound_eps}  and \ref{fig:main_numobs_bound}, we show how the lower bound closely
predicts the robust error gap in our synthetic experiments.
Furthermore, by the dependence of $\varphi_{\text{min}}$ on the overparameterization ratio $d/n$, the lower bound on the robust error gap is amplified for large $d/n$.
Indeed, Figure~\ref{fig:main_numobs_bound} shows how the error gap increases with $d/n$
both theoretically and experimentally. However, when $d/n$ increases above a certain threshold, the gap decreases again, as standard training fails to learn the signal and yields a high error (see Figure~\ref{fig:main_numobs}).













\begin{figure*}[!t]
\centering
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/d_n_logreg.png}
  \caption{Robust error vs $\eps_{\text{tr}}$}
  \label{fig:eps_logreg}
\end{subfigure}
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/logreg_trade_off_plot.png}
  \caption{Robust error decomposition}
  \label{fig:main_robust}
\end{subfigure}
\begin{subfigure}[b]{0.32\textwidth}
  \includegraphics[width=0.99\linewidth]{plotsAistats/linear_intuition_try.png}
  \caption{Intuition in 2D}
  \label{fig:2D_dataset_intuition}
\end{subfigure}
\caption{(a) We set $d=1000$ and $r = 12$ and plot the robust error with increasing adversarial training budget ($\eps_{\text{tr}}$) and with increasing $d/n$.  (b) We plot the robust error decomposition in susceptibility and standard error for increasing adversarial budget $\eps_{\text{tr}}$. 
 
  Full experimental details can be found in Section~\ref{sec:logregapp}. (c) 2D illustration providing intuition for the linear setting: Training on directed attacks\xspace (yellow) effectively corresponds to fiting the original datapoints (blue) after shifting them closer to the decision boundary. The robust max-$\ell_2$-margin (yellow dotted) is heavily tilted if the points are far apart in the non-signal dimension, while the standard max-$\ell_2$-margin solution (blue dashed) is much closer to the ground truth (gray solid). }
\label{fig:lineartradeoff}
\end{figure*}

\subsection{Proof idea: intuition and surprises}
\label{logreg_proof_sketch}

The reason that adversarial training hurts
robust generalization is based on an extreme robust vs. standard
error tradeoff. We provide intuition for the effect of
directed attacks\xspace and the small sample regime
on the solution of adversarial training by decomposing the
robust error $\roberr{\theta}$.
Notice that $\eps_{\text{te}}$-robust error $\roberr{\theta}$ 
can be written as the probability of the union of two events: the
event that the classifier based on $\theta$ is wrong and the event
that the classifier is susceptible to attacks:
\begin{equation}
 \label{eq:decomposition}
\begin{aligned}
     \roberr{\theta} &=  \mathbb{E}_{x, y\sim \mathbb{P}}  \left[\Indi{y f_\theta (x) <0} \vee \max_{x' \in \pertset{x}{\eps_{\text{te}}}} \Indi{f_\theta(x) f_\theta(x')<0} \right] \\
  &\leq \stderr{\theta} + \suscept{\theta}
\end{aligned}
\end{equation}
where $\suscept{\theta}$ is the expectation of the maximization term in Equation \eqref{eq:decomposition}.
$\suscept{\theta}$ represents the $\eps_{\text{tr}}$-\emph{attack-susceptibility} of a classifier
induced by $\theta$ and $\stderr{\theta}$ its standard error.
Equation~\eqref{eq:decomposition} suggests that
the robust error can only be small if both the standard error and
susceptibility are small. In Figure~\ref{fig:main_robust}, we
plot the decomposition of the robust error in standard error and susceptibility for adversarial logistic regression with increasing $\eps_{\text{tr}}$. We observe that increasing $\eps_{\text{tr}}$
increases the standard error too drastically compared to the decrease
in susceptibility, leading to an effective drop in robust accuracy. For completeness, in Appendix \ref{app:susc}, we provide upper and lower bounds for the susceptibility score.  We
now explain why, in the small-sample size regime, adversarial training
with directed attacks\xspace ~\eqref{eq:linfmaxpert} may increase standard
error to the extent that it dominates the decrease in susceptibility.








A key observation is that the robust max-$\ell_2$-margin solution of a
dataset $D= \{(x_i, y_i)\}_{i=1}^n$ 
maximizes the minimum margin that reads ${\min_{i\in [n]}
  y_i \theta^\top (x_i - y_i \eps_{\text{tr}} |\thetaind{1}| e_1)}$, where
$\indof{\theta}{i}$ refers to the $i$-th entry of vector $\theta$. Therefore, it
simply corresponds to the max $\ell_2$-margin solution of the dataset
shifted towards the decision boundary ${D_{\epstrain} = \{(x_i - y_i \eps_{\text{tr}}
  |\indof{\thetahat{\eps_{\text{tr}}}}{1}| e_1, y_i)\}_{i=1}^n}$.
Using this fact, we obtain 
a closed-form expression of the (normalized) max-margin solution~\eqref{eq:maxmargin} as a function of
$\eps_{\text{tr}}$ that reads
\begin{equation}
  \label{eq:maxmarginmaintext}
\thetahat{\eps_{\text{tr}}} = \frac{1}{(r-2\eps_{\text{tr}})^2 + 4 \tilde{\gamma}^2}
\left[r - 2\eps_{\text{tr}}, 2 \tilde{\gamma} \tilde{\theta} \right],
\end{equation} 
where $\|\tilde{\theta}\|_2 = 1$ and $\tilde{\gamma} >0$ is a random quantity
associated with the max-$\ell_2$-margin solution of the
$d-1$ dimensional Gaussian inputs orthogonal
to the signal direction
(see Lemma~\ref{lem:maxmargin} in Section~\ref{sec:app_theorylinear}).

In high dimensions, with high probability any two
Gaussian random vectors are far apart -- in our
distributional setting, this corresponds to the vectors being far
apart in the non-signal directions. In
Figure~\ref{fig:2D_dataset_intuition}, we illustrate the phenomenon
using a simplified 2D cartoon, where the few samples
in the dataset are all far apart in the non-signal direction.
We see how shifting the dataset closer to the true decision boundary,
may result in a max-margin solution (yellow) that aligns much worse
with the ground truth (gray), compared to the estimator learned from
the original points (blue). Even though the new (robust max-margin)
classifier (yellow) is less susceptible to directed attacks in the
signal dimension, it also uses the signal dimension less.
Mathematically, this is directly
reflected in the expression of the max-margin solution in
Equation~\eqref{eq:maxmarginmaintext}: Even without the definition of
$\tilde{\gamma}, \tilde{\theta}$, we can directly see that the first
(signal) dimension is used less as $\eps_{\text{tr}}$ increases.


















\subsection{Generality of the results}

In this section we discuss how the theorem might generalize to
other perturbation sets, models and training procedures.
\paragraph{Signal direction is known}
The type of additive perturbations used in Theorem~\ref{thm:linlinf},
defined in Equation~\eqref{eq:linfmaxpert}, is explicitly constrained
to the direction of the true signal. This choice is reminiscent of
corruptions where every possible perturbation in the set is directly
targeted at the object to be recognized, such as motion blur of moving
objects.  Such corruptions are also studied in the context of domain
generalization and adaptation \cite{Schneider20}.

Directed attacks\xspace in general, however, may also consist of
perturbation sets that are only strongly biased towards the true
signal direction, such as mask attacks.  They may find the true signal
direction only when the inner maximization is
exact. The following corollary extends Theorem~\ref{thm:linlinf} to
small $\ell_1$-perturbations
\begin{equation}
  \label{eq:l1maxpert}
  \pertset{x}{\epsilon} = \{x'=x+\delta \mid \|\delta\|_1 \leq \epsilon\},
\end{equation}
for $0<\epsilon<\frac{r}{2}$ that reflect such attacks. We state the corollary here and give the proof in Appendix \ref{sec:app_theorylinear}.
\begin{corollary}
\label{cor:l1extension}
  Theorem~\ref{thm:linlinf} also holds for ~\eqref{eq:maxmargin} with perturbation sets defined in \eqref{eq:l1maxpert}.
\end{corollary}
The proof uses the fact that the inner maximization
effectively results in a sparse perturbation equivalent to the attack
resulting from the perturbation set~\eqref{eq:linfmaxpert}. 




\paragraph{Other models}
Motivated by the implicit bias results of (stochastic)
gradient descent on the logistic loss, Theorem~\ref{thm:linlinf} is proven for the max-$\ell_2$-margin
solution. We would like to conjecture
that for the data distribution in Section \ref{sec:theoryresults},
adversarial training can hurt robust generalization also for other models with zero
training error (\emph{interpolators} in short).

For example, Adaboost is a widely used algorithm that converges to the max-$\ell_1$-margin classifier \cite{telgarsky13}. One might argue that for a sparse ground truth, the max-$\ell_1$-margin classifier should (at least in the noiseless case) have the right inductive bias to alleviate large bias in high dimensions. Hence, in many cases the (sparse) max-$\ell_1$-margin solution might align with the ground
truth for a given dataset. However, we conjecture that even in this
case, the \emph{robust} max-$\ell_1$-margin solution (of the dataset
shifted towards the decision boundary) would be misled to choose a
wrong sparse solution. This can be seen with the help of the cartoon
illustration in Figure \ref{fig:2D_dataset_intuition}.






