\documentclass[accepted]{uai2022}

\usepackage{microtype}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage[colorinlistoftodos,textsize=scriptsize]{todonotes}
\usepackage{marginnote}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{float}
\usepackage{subcaption}
\usepackage{wrapfig}
\graphicspath{{./Figures/}}

% \usepackage{zref-xr,zref-user}
\usepackage{nameref}
\usepackage{zref-xr}
\zxrsetup{toltxlabel}
% \zexternaldocument*{appendix}
\zexternaldocument*{maini_322-supp}
% \externaldocument{../maini_322-supp}

\linepenalty=1000


\newcommand{\theHalgorithm}{\arabic{algorithm}}
\newtheorem{theorem}{Theorem}

\def\ours{\textsc{Protector}}
\def\re{\text{ReColor}}
\def\st{\text{StAdv}}
\def\mp{M_\mathcal{A}}


\definecolor{darkgreen}{rgb}{0,0.3,0}
\definecolor{darkblue}{rgb}{0,0,0.5}
\definecolor{darkorange}{rgb}{0.9,0.4,0}
\newcommand{\eat}[1]{}

\usepackage[utf8]{inputenc} %
\usepackage[T1]{fontenc}    %
\usepackage{url}            %
\usepackage{booktabs}       %
\usepackage{amsfonts}       %
\usepackage{nicefrac}       %
\usepackage{microtype}      %
\usepackage{xcolor}         %
\usepackage[american]{babel}
\usepackage{natbib} %
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}


\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\usepackage{mathtools} %
\usepackage{booktabs} %
\usepackage{tikz} %


\title{Perturbation Type Categorization for Multiple Adversarial Perturbation Robustness}

\author[1]{\href{mailto:<pratyushmaini@cmu.edu>?Subject=Your UAI 2022 paper}{Pratyush Maini}{}}
\author[2]{Xinyun Chen}
\author[3]{Bo Li}
\author[2]{Dawn Song}

\affil[1]{%
    Carnegie Mellon University
}
\affil[2]{%
    University of California, Berkeley
}
\affil[3]{%
    University of Illinois at Urbana-Champaign
  }
  
  \begin{document}
\maketitle

\begin{abstract}
    Recent works in adversarial robustness have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, 
    these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against a single perturbation type. 
    In this work, we introduce the problem of categorizing adversarial examples based on their perturbation types. 
    We first theoretically show on a toy task that adversarial examples of different perturbation types constitute different distributions---making it possible to distinguish them. We support these arguments with experimental validation on multiple $\ell_p$ attacks and common corruptions. 
    Instead of training a single classifier, we propose \ours{}, a two-stage pipeline that first categorizes the perturbation type of the input, and then makes the final prediction using the classifier specifically trained against the predicted perturbation type. 
    We theoretically show that at test time the adversary faces a natural trade-off between fooling the perturbation classifier and the succeeding classifier optimized with perturbation-specific adversarial training. 
    This makes it challenging for an adversary to plant strong attacks against the whole pipeline.
    Experiments on MNIST and CIFAR-10 show that \ours{} outperforms prior adversarial training-based defenses by over 5\% when tested against the union of $\ell_1, \ell_2, \ell_\infty$ attacks. 
    Additionally, our method extends to a more diverse attack suite, also showing large robustness gains against multiple $\ell_p$, spatial and recolor attacks.    
\end{abstract}

\section{Introduction}
\label{section:intro}

Machine learning models have been shown to be vulnerable to different types of adversarial examples---inputs with a small magnitude of perturbation added to mislead the classifier's prediction~\citep{szegedy2013intriguing}. %
Consequently, many defenses have been proposed to improve their robustness, a majority of which focus on achieving robustness against a specific perturbation type \citep{goodfellow2014explaining,aleks2017deep,kurakin2017adversarial,tramer2018ensemble,Dong_2018_CVPR,zhang2019theoretically,carmon2019unlabeled}. 
However, as ML models get adopted in real-world applications, it becomes important for the defenses to be robust against different 
types of perturbations given the flexibility of practical attackers.
In addition, prior work showed that when models are trained to be robust against one perturbation~type,~the~robustness is typically not preserved against attacks of a different type~\citep{schott2018towards, kang2019testing}. 

Motivated by the need for robustness against diverse perturbation types, recent works have attempted to train models that are robust against multiple perturbation types~\citep{tramer2019adversarial, maini2019adversarial,laidlaw2021perceptual}. These works consider perturbations restricted by their $\ell_p$ norms ($p \in \{1,2,\infty\}$) or spatial and color transformations. The proposed methods improve the overall robustness against multiple perturbation types. However, when evaluating the robustness against each individual perturbation type, the robustness of models trained by these methods is still considerably worse than those trained on a single perturbation type.
Given these empirical observations, in this work we aim to answer: \textit{Are different types of perturbations separable? Can we categorize them to improve robustness to multiple adversarial perturbations?}


To address these questions and explore the properties of different perturbation types, we introduce the problem of \textit{categorizing adversarial examples} based on their perturbation types. 
We present theoretical analysis on a toy task to show that 
when we add different types of perturbations to benign samples of a given ground-truth class, their new distributions are distinct and separable.
We experimentally validate our theoretical results on both (mathematically) well-defined perturbation regions such as $\ell_p$ balls, as well as various common corruptions~\citep{hendrycks2019benchmarking}. We find that deep networks are able to categorize different perturbation types with high accuracy ($>95\%$). Further, our perturbation classifier shows high generalization accuracy ($ \sim 90\%$) to \textit{unseen} common corruptions, i.e., correctly predicting their categories (weather, noise, blur, or digital) without training on them.
While in this work we focus on improving worst-case adversarial robustness, applications of categorizing perturbation types extend beyond it---such as 
detecting \emph{systematic} distribution shifts (e.g. presence of snow for self-driving cars~\citep{michaelis2020benchmarking}). Further, using a perturbation classifier as the discriminator may improve the effectiveness and variety of adversarial examples produced by generative models~\citep{wong2021learning,xiao2018generating,song2018constructing}.

Based on our theoretical analysis, we propose \emph{\ours}, a two-stage pipeline that performs~\emph{Perturbation Type Categorization to Improve Robustness} against multiple perturbations. First, the top-level perturbation classifier predicts the perturbation type of the input. Then, among the second-level predictors, {\ours} selects the one that is the most robust to the predicted perturbation type to make the final prediction.  We theoretically show that there exists a natural tension between attacking the perturbation classifier and the second-level predictors. Specifically, strong attacks against the second-level predictors make it easier for the perturbation classifier to predict the adversarial perturbation type; on the other hand, fooling the perturbation classifier requires planting weaker (or less representative) attacks against the second-level predictors. As a result, even an \textit{imperfect} perturbation classifier significantly improves the model's overall robustness to multiple~perturbation~types. We also supplement our theoretical statements on the toy task with experimental validation in the exact same setting.

Empirically\footnote{Code for reproducing our experiments can be found at \href{https://github.com/sunblaze-ucb/adversarial-protector}{https://github.com/sunblaze-ucb/adversarial-protector}.}, we first show that the perturbation classifier generalizes well on classifying a wide range of adversarial perturbations. 
Then we compare {\ours} with recent defenses against multiple attack types on MNIST and CIFAR-10. 
Even though we do not utilize adversarial training~\citep{goodfellow2014explaining} to train the perturbation classifier, an ensemble of diverse perturbation classifiers along with adding small noise to inputs help make \ours{} robust against adaptive attacks. Specifically, we combine predictions of perturbation classifiers that classify adversarial examples in their image and Fourier domains~\citep{yin2019fourier}. 
This further increases the tension between attacking top-level and second-level components by reducing the space of successful adversarial attacks. 
{\ours} outperforms prior approaches by over 5\% against the union of 
$\ell_1, \ell_2$ and $\ell_\infty$ attacks. 
From the suite of 15 different attacks tested, the average improvement over all the attacks w.r.t. the state-of-art baseline defense is $\sim15\%$ on both MNIST and CIFAR-10. 
Training a model to be robust against multiple attacks typically imposes a significant tradeoff against the accuracy on benign samples, but \ours{} attains $\sim7\%$ greater benign test accuracy on CIFAR-10 as compared to recent works~\citep{laidlaw2021perceptual,maini2019adversarial}. 
We further demonstrate how our defense naturally extends beyond $\ell_p$ perturbation types, where we assess the robustness of our model against the union of $\ell_\infty$, $\ell_2$, spatial~\citep{wong_wasserstein_2019,xiao2018spatially} and recolor~\citep{Bhattad2020Unrestricted,laidlaw2019functional} attacks on CIFAR-10. Our defense exceeds the robustness of recent work~\citep{laidlaw2021perceptual} by over 13\% against all attacks.
In addition, {\ours} provides the flexibility to plug in and integrate new defenses against individual perturbation types into the existing framework as second-level predictors, thus the defense performance of {\ours} can be continuously improved with the development of more advanced defenses against single perturbation types.



\section{Related Work}

\textbf{Adversarial examples.} 
Among the different types of adversarial attacks studied in prior work~\citep{szegedy2013intriguing,goodfellow2014explaining,aleks2017deep,hendrycks2019natural,Bhattad2020Unrestricted}, the majority constrain the perturbation within a small $\ell_p$ region around the original input. To improve model robustness in the presence of such adversaries, most existing defenses utilize adversarial training~\citep{goodfellow2014explaining}, which augments the training dataset with adversarial examples. Till date, different variants of adversarial training algorithms remain the most successful defenses against adversarial attacks~\citep{carmon2019unlabeled,zhang2019theoretically,wong2020fast,rice2020overfitting,wang2020onceforall}. Other types of defenses include input transformation~\citep{guo2018countering,Buckman2018ThermometerEO} and network distillation~\citep{papernot2016distillation}, but were rendered ineffective under stronger adversaries \citep{he2017adversarial, carlini2017adversarial, athalye2018obfuscated, tramer2020adaptive}. 


\textbf{Defenses against multiple perturbation types.} 
Some recent works have focused on defending against a union of norm bounded $\ell_p$ attacks. 
\citet{schott2018towards, kang2019testing} showed that models that were trained for a given $\ell_p$-norm bounded attack are not robust against attacks in a different $\ell_q$ region. 
\citet{schott2018towards} proposed the use of multiple variational autoencoders to achieve robustness to multiple $\ell_p$ attacks on MNIST. 
\citet{tramer2019adversarial} used simple aggregations of multiple adversaries to achieve non-trivial robust accuracy against $\ell_1, \ell_2, \ell_\infty$ attacks. \citet{maini2019adversarial} proposed MSD that takes gradient steps in the union of multiple $\ell_p$ regions to improve multiple perturbation robustness. 
Most recently, \citet{laidlaw2021perceptual} proposed a defense against unseen perturbations using perceptual adversarial training. They evaluate their work against $\ell_\infty$, $\ell_2$, spatial, recolor adversaries. 

\textbf{Detection of adversarial examples.} Multiple prior works have focused on detecting adversarial examples~\citep{feinman2017detecting, lee2018simple, ma2018characterizing, cennamo2020statistical, fidel2019explainability,yin2019adversarial}. However, most of these methods were rendered ineffective in the presence of adaptive adversaries \citep{carlini2017adversarial, tramer2020adaptive}. 
In comparison, our work focuses on a more challenging problem of categorizing perturbation types. 
To this end, \citet{yin2019fourier} proposed the examination of Fourier transforms of adversarial examples to determine the adversarial attack and corruption types. 




\begin{figure*}[ht]
    \centering
    \begin{subfigure}[t]{0.35\linewidth}
      \includegraphics[width=\linewidth]{figures/Pipeline.pdf}
      \caption{}
      \label{fig:pipeline-a}
    \end{subfigure}\hspace{15mm}
    \begin{subfigure}[t]{0.35\linewidth}
       \includegraphics[width=\linewidth]{figures/Tradeoff.pdf}
       \caption{}
      \label{fig:tradeoff}
    \end{subfigure}
    \caption{An overview of {\ours}. (a) The perturbation classifier $C_{adv}$ categorizes representative attacks of different types. (b) An illustration of the trade-off in Theorem~\ref{thm:trade-off}. An adversarial example fooling $C_{adv}$ (the $\ell_\infty$ sample marked in red) becomes weaker to attack the second-level $\mp{}$ models. Stronger or more representative attacks (marked green) are correctly categorized.}
     \label{fig:pipeline}
    \end{figure*}
    
    \section{Separability of Perturbation Types}
    \label{section:multiple-perturb}
    In this section, we formally illustrate the setup of perturbation categorization. 
    In Theorem~\ref{thm:separability}, we show the existence of a classifier that can separate adversarial examples belonging to different perturbation types. We focus on $\ell_p$ attacks (that can be fully specified mathematically) on a simplified binary classification task for the convenience of theoretical analysis.  
    However, {\ours} can also improve the empirical robustness of models trained on common image classification benchmarks against both $\ell_p$ and non-$\ell_p$ attacks. We will discuss the empirical examination in Section~\ref{sec:exp}.
    
    \subsection{Problem Setting}
    \textbf{Data distribution.}
    We consider a distribution $\mathcal{D}$ of inputs sampled from the union of two multi-variate Gaussian distributions such that the input-label pairs $(x,y)$ can be described as:
    \begin{align}
        \centering
        \begin{split}
            y \stackrel{u.a.r}{\sim}&\{-1,+1\}, \\
            x_0 {\sim} \mathcal{N}(y\alpha, \sigma^2),
            \quad
            &x_1, \dots, x_d \stackrel{i.i.d}{\sim} \mathcal{N}(y\eta, \sigma^2),
        \end{split}
    \end{align}
    where $x = \left[x_0,x_1,\dots,x_d\right] \in \mathcal{R}^{d+1}$ and $\eta = \frac{\alpha}{\sqrt{d}}$.
    This setting demonstrates the distinction between a feature $x_0$ that is strongly correlated with the label, and $d$ weakly correlated features that are (independently) normally distributed with the mean $y\eta$ and the variance $\sigma^2$. In our work, we assume that $\frac{\alpha}{\sigma} > 10$ ($x_0$ is strongly correlated) and $d>100$ (remaining $d$ features are weakly correlated, but together represent a strongly correlated feature). This setting was adapted from~\cite{ilyas2019adversarial}, and more discussion can be found in Appendix~\ref{app:sep-proof-setting}.
    
    
    
    \textbf{Perturbation types.}
    We focus our theoretical discussion on adversaries constrained within a fixed $\ell_p$ region of radius $\epsilon_p$ around the original input, for $\ell_p \in \mathcal{S} = \{\ell_1,\ell_\infty\}$. Such adversaries are frequently studied in existing work for finding the optimal first-order perturbation for different attack types. Let $\ell (\cdot, \cdot)$ be the cross-entropy loss, and $\Delta_\mathcal{S} = \bigcup_{\ell_p \in \mathcal{S}} \Delta_{\ell_p,\epsilon}$ for the $\ell_p$ threat model,   $\Delta_{\ell_p,\epsilon_p}$, of radius $\epsilon_p$. Then, for a model $f_\theta$, the optimal perturbation $\delta^*$ is given by:
    \begin{equation}
    \label{eqn:adv-gen}
            \delta^{*} 
            = \text{arg}\max_{\delta \in \Delta_{\mathcal{S}}} \ell (f_\theta(x+\delta), y).
    \end{equation}
    
    \subsection{Separability of $\ell_p$ Perturbations}
    \label{subsec:sep-perturb}
    Consider a classifier $M$ trained with the objective of correctly classifying inputs $x\in\mathcal{D}$.
    The goal of the adversary is to fool $M$ by finding the optimal perturbation $\delta_\mathcal{A}\; \forall \mathcal{A} \in S$. The theorem below shows that the distributions of adversarial inputs within different $\ell_p$ regions can be separated with a high accuracy.
    \begin{theorem}[Separability of perturbation types]
    \label{thm:separability}
    Given a binary Gaussian classifier $M$ trained
    on $\mathcal{D}$,
    consider $\mathcal{D}_p^y$ to be the distribution of optimal adversarial inputs (for a class $y$) against $M$, within $\ell_p$ regions of radius $\epsilon_p$, where $\epsilon_1 = \alpha$, $\epsilon_\infty=\alpha/\sqrt{d}$.
    Distributions $\mathcal{D}_p^y$ ($p \in \{1,\infty\}$)  can be accurately separated by a binary Gaussian classifier $C_{adv}$ with a misclassification probability $P_e \leq 10^{-24}$.
    \end{theorem}
    
    The proof sketch is as follows. We first calculate the optimal weights of a binary Gaussian classifier $M$ trained on $\mathcal{D}$. Accordingly, for any input $x\in\mathcal{D}$, we find the optimal adversarial perturbation $\delta_\mathcal{A}\; \forall \mathcal{A}\in\{\ell_1,\ell_\infty\} $ against $M$. We discuss how these perturbed inputs $x+\delta_\mathcal{A}$ also follow a normal distribution, with shifted means. Finally, for data points of a given label, we show that $C_{adv}$ is able to predict the correct perturbation type with a very low error. We present the formal~proof in~Appendix~\ref{app:sep-proof}.
    
    
\section{\ours{}: Perturbation Type Categorization for Robustness}
\label{sec:method}
We illustrate the {\ours} pipeline in Figure~\ref{fig:pipeline}. {\ours} performs the classification task as a two-stage process. Given an input, {\ours} first utilizes a~\emph{perturbation classifier} $C_{adv}$ to predict its perturbation type. Then, based on the predicted type, {\ours} uses the corresponding second-level predictor $\mp{}$ to provide the final prediction, where $\mp{}$ is specially trained to be robust against the attack $\mathcal{A}\in\mathcal{S}$. Formally, let $f_\theta$ be the {\ours} model, then:
\begin{equation}
\label{eqn:ctp}
    \begin{split}
    f_\theta(x) &= \mp{}(x); \quad s.t. \quad \mathcal{A} = \operatorname{argmax} C_{adv}(x).
    \end{split}
\end{equation}
    
    
\subsection{Adversarial Trade-off}
\label{subsec:trade-off}
In Section~\ref{subsec:sep-perturb}, we showed that the optimal perturbations of different attack types belong to different data distributions, and can be separated by a simple classifier. However, in the white-box setting, the adversary has knowledge of both the perturbation classifier ($C_{adv}$) and specialized robust models ($\mp{}$). This allows it to adapt the attack to fool the entire pipeline instead of individual models alone.
To validate the robustness of \ours{}, we provide a theoretical justification in Theorem \ref{thm:trade-off}, showing that \ours{} naturally offers a trade-off between fooling $C_{adv}$ and the individual models $\mp{}$. This makes it difficult for adversaries to stage successful attacks against \ours{}.

Note that there are some overlapping regions among different perturbation constraints. For example, every adversary could set $\delta_p = 0$ as a valid perturbation, in which case $C_{adv}$ can not correctly classify all attacks. However, such perturbations are not useful to the adversary, because any $\mp$ can correctly classify unperturbed inputs with a high probability. In the following theorem, we examine the robustness of {\ours} in the presence of such strong dynamic adversaries.

\begin{theorem}[Adversarial trade-off]
\label{thm:trade-off}

Given a data distribution $\mathcal{D}$, adversarially trained models $M_{\ell_p,\epsilon_p}$, and an attack classifier $C_{adv}$ that distinguishes perturbations of different $\ell_p$ attack types for $p \in \{1,\infty\}$; the probability of a successful attack by the strongest adversary over the {\ours} pipeline is $ P_e < 0.01$ for ${\epsilon_1} = \alpha + 2\sigma$ and ${\epsilon_\infty} = \frac{\alpha + 2\sigma}{\sqrt{d}}$.

\end{theorem}

Here, the \textit{worst-case adversary} refers to an adaptive adversary that has full knowledge of the defense strategy.
In Appendix~\ref{app:subsec:perturbation_size}, we discuss how ${\epsilon_1},{\epsilon_\infty}$ are set so that the $\ell_1$ and $\ell_\infty$ adversaries can fool $M_{\ell_\infty,\epsilon_\infty}$ and $M_{\ell_1,\epsilon_1}$ models respectively with a high success rate. 
To prove Theorem~\ref{thm:trade-off}, we first show that when trained on $\mathcal{D}$, an adversarially robust model $\mp{}$ can achieve robust accuracy $>99\%$ against the attack type it was trained for, and $<2\%$ against an alternate attack.
By ``alternate'' we mean that for an $\ell_q$ attack, the prediction is made by the $M_{\ell_p,\epsilon_p}$ model.
Then, we analyze the modified distributions of the inputs perturbed by different $\ell_p$ attacks. Based on this, we construct a simple decision rule for the perturbation classifier $C_{adv}$. Finally, we compute the perturbation induced by the worst-case adversary. We show that there exists a trade-off between fooling the  $C_{adv}$ (to allow the alternate $M_{\ell_p,\epsilon_p}$ model to make the final prediction for an $\ell_q$ attack $\forall p,q\in \{1,\infty\}; p\neq q$), and fooling the alternate $M_{\ell_p,\epsilon_p}$ model itself.
We provide an illustration of the trade-off in Figure~\ref{fig:tradeoff}, and a formal proof and \textit{experimental validation} on the toy task in Appendix~\ref{app:trade-off-proof}. 

    

\section{Training and Inference}
\label{sec:implementation}
We now extend \ours{} to deep neural networks trained on common image classification benchmarks. Following prior work on defending against multiple perturbation types,
 we evaluate on MNIST \citep{lecun2010mnist} and CIFAR-10 \citep{krizhevsky2012cifar} datasets. Here, we present the training details, the formulation of an ensemble of perturbation classifiers, and adaptive white-box attacks against {\ours}.

\subsection{Dataset Creation}
\label{subsec:dataset-creation}
\begin{figure}[t]
        \centering
        \begin{subfigure}[b]{0.65\columnwidth}
        \centering
        \includegraphics[width=0.99\columnwidth]{figures/plot.pdf}
          \caption{}\label{fig:PCA}
        \end{subfigure}
        \\
        \begin{subfigure}[b]{0.6\columnwidth}
        \centering
              \includegraphics[width=0.99\columnwidth]{figures/Noise.pdf} 
          \caption{}\label{fig:Noise}
        \end{subfigure}
        
        \caption{(a) PCA for different adversarial perturbations on MNIST. (b) Illustration of the effect of random noise on generating adversarial examples. The notion of small, large perturbations is only used to illustrate the scenario in Figure~\ref{fig:Noise}, and neither perturbation region~subsumes~the~other.}\label{fig:test}
\end{figure}

To train our perturbation classifier $C_{adv}$, we create a dataset that includes adversarial examples of different perturbation types. We perform adversarial attacks against each of the individual $\mp{}$ models used in {\ours} to curate the training and test sets. In the case of $\ell_p$ examples, we use the PGD attack~\citep{aleks2017deep}, and for spatial~\citep{xiao2018spatially} and recolor~\citep{laidlaw2019functional} attacks, we use their original attack formulation.
The time for creating the dataset against each $\mp$ is the same as running a single epoch of adversarial training. Since most recent works typically train their models for $\sim$200 epochs, the dataset creation time is insignificant when compared with the cost of training an $\mp$ model.

\textbf{Combining perturbation types.} 
When training \ours{} to be robust against a set $\mathcal{S}$ of multiple ($k$) attacks, we combine certain perturbation types under the same label to improve the overall robustness. This is beneficial when: (a) a specialized model $\mp$ also shows a high degree of robustness to a different attack $\mathcal{B}\in\mathcal{S}$, s.t. $\mathcal{A}\neq\mathcal{B}$; (b) two different attack types $\mathcal{A},\mathcal{B}\in\mathcal{S}$ have similar characteristics.
For instance, in case of $\ell_p$ attacks, we perform binary classification between $\mathcal{A}=\{\{\ell_1,\ell_2\},\ell_\infty\}$. We hypothesize that compared to $\ell_\infty$ adversarial examples, $\ell_1$ and $\ell_2$ adversarial examples 
show similar characteristics.
To provide an intuitive illustration, we randomly sample 10K adversarial examples generated with PGD attacks on MNIST, and present their Principal Component Analysis (PCA) in Figure~\ref{fig:PCA}. We observe that the first two principal components for $\ell_1$ and $\ell_2$ adversarial examples are largely overlapping, while those for $\ell_\infty$ are clearly from a different distribution.\footnote{The visualization only serves as motivation. It does not suggest that $\ell_1$, $\ell_2$ examples are not separable.} 
For the MNIST dataset, we use the $M_{\ell_2}, M_{\ell_\infty}$ models in \ours{}, and we use $M_{\ell_1}, M_{\ell_\infty}$ models for CIFAR-10. The choice is made based on the robustness of $\{M_{\ell_2},M_{\ell_1}\}$ models against $\{\ell_1,\ell_2\}$ attacks respectively, as will be depicted in Table~\ref{tab:res}. 
Similarly, when defending against the union of $\ell_p$ and non-$\ell_p$ perturbation types on CIFAR-10, we classify $\mathcal{A} = \{\{\ell_\infty, \ell_2, \re\},\st\}$ attacks based on the robustness of each $\mp{}$  against every attack  $\mathcal{B}\in\mathcal{S}$. We report the robustness of \ours{} with varying number of second-level predictors in Appendix~\ref{app:subsec:number-of-mp}.

\subsection{Training}
Past works~\citep{maini2019adversarial,tramer2019adversarial} on robustness to multiple attack types require intensive hyperparameter tuning to \emph{balance} different attack types when one attack is stronger than others. We find that a similar phenomenon plagues the adversarial training (AT) of $C_{adv}$. Therefore, we train $C_{adv}$ over a static dataset, 
which is fast and stable. Specifically, using a single GTX 1080Ti GPU, $C_{adv}$ can be trained within 5 and 30 minutes on MNIST and CIFAR-10 respectively (given that we already have access to perturbation-specific robust models). On the other hand, training state-of-the-art models robust to a single perturbation type requires up to 2 days to train on the same amount of GPU power, and existing defenses against multiple (\textit{k}) perturbation types take \textit{k} times as long as the training time for robustness against a single perturbation type. Instead, even when the individual $\mp$ are unavailable, we can train the \textit{k} models in parallel to improve training speed.

A key advantage of {\ours}'s design is that it can build upon existing defenses against individual perturbation types. Specifically, we leverage the adversarially trained models developed in prior work~\citep{zhang2019theoretically,carmon2019unlabeled} as $\mp{}$ models in our pipeline. The architecture of $C_{adv}$ is also similar to a single $\mp{}$ model. See Appendix~\ref{app:architecture} for more details.

\subsection{Inference Procedure}
\label{sec:inference}
\textbf{Ensemble of diverse perturbation classifiers.}
While $C_{adv}$ learns the ability to distinguish between different attack types, it is not immune to the presence of adaptive adversaries that try to fool $C_{adv}$ and the $\mp$ models together. 
To improve model robustness against such adversaries, we attempt to increase the trade-off in \ours{} that was described in Section~\ref{subsec:trade-off}. We use an ensemble (average of prediction logits) of two perturbation classifiers that classify adversarial examples in different domains -- via the Fourier and image domains.\footnote{Adversaries can still back-propagate through the Fourier transformation steps.} Owing to this diversity, the classification landscape of each $C_{adv}$ is different. 
Intuitively, the trade-off between fooling the two stages of \ours{} confines the adversary in a very small region for generating successful adversarial attacks when using an ensemble of perturbation classifiers. In Appendix~\ref{app:fourier}, we show how the adversarial examples can be visually separated in the Fourier domain~\citep{yin2019fourier} and discuss further implementation details of the ensemble.

\textbf{Constraining the adversary using random noise.}
While past work has \citep{hu2019new} suggested that adding random noise does not help defend against adversarial inputs, it is the \emph{unique} exhibition of the trade-off described in Theorem~\ref{thm:trade-off} that adversarial attacks against \ours{}, on the contrary, are likely to fail when added with random noise.
Intuitively, the trade-off between fooling the two stages of \ours{} confines the adversary in a very small region for crafting successful attacks.

Consider the illustrative example in Figure~\ref{fig:Noise}. The input $(x,y=0)$ is subjected to an $\ell_\infty$ attack. Assume that the $M_{\ell_\infty,\epsilon_\infty}$ model is a perfect classifier for adversarial examples within a fixed $\epsilon_\infty$ region. The dotted line shows the decision boundary for $C_{adv}$, which correctly classifies inputs subjected to $\ell_\infty$ perturbations $\delta''$ as $\ell_\infty$ attacks (green), but misclassifies samples with smaller perturbations.
When the adversary adds a large perturbation $\delta''$, the prediction of $M_{\ell_1}$ for the resulted input $x''$ becomes wrong, but the perturbation classifier also categorizes it as an $M_{\ell_\infty}$ attack, thus the final prediction of \ours{} is still correct since it will be produced by $M_{\infty,\epsilon_\infty}$ model instead. On the other hand, when the adversary adds a small perturbation $\delta'$ to fool the perturbation classifier, adding a small amount of random noise can recover the correct prediction with a high probability. Note that every point on the boundary of the noise region (yellow circle) is correctly classified by the pipeline. In this way, adding random noise exploits an adversarial trade-off for {\ours} to achieve a high accuracy against adversarial examples, in the absence of adversarial training. In our implementation, we sample random noise $z\sim\mathcal{N}(0,I)$, and add $\hat{z} = \epsilon_2 \cdot z/|z|_2$ to the model input.

\subsection{Adaptive Attacks against {\ours}}
\label{subsec:adaptive}

\textbf{Gradient propagation.}
Since the final prediction in Equation~\ref{eqn:ctp} only depends on a single $\mp{}$ model, 
the pipeline does not allow gradient flow across the
two levels.
This can make it difficult for gradient-based adversaries to attack $\ours{}$. Therefore,
we utilize a combination of predictions from each individual $\mp{}$ model
by modifying $f_\theta(x)$ in Equation~\ref{eqn:ctp} as follows:
\begin{equation}
\label{eqn:adaptive_softmax}
    \begin{split}
    c = \operatorname{softmax}(C_{adv}(x)); \\
    f_\theta(x) = 
    \sum_{\mathcal{A} \in \mathcal{S}} c_\mathcal{A} \cdot \mp{}(x),
    \end{split}
\end{equation}
where $c_\mathcal{A}$ denotes the probability of the input $x$ being classified as the perturbation type $\mathcal{A}$ by $C_{adv}$. Equation~\ref{eqn:adaptive_softmax} is only used for the purpose of generating adversarial examples and performing gradient-based attack optimization. For consistency, we still use Equation~\ref{eqn:ctp} to compute the model prediction at inference (final forward-propagation). We do not see any significant performance advantages of either choice during inference, and briefly report a comparison in Appendix~\ref{app:inference-exp}.

\textbf{Separately attacking $C_{adv}$ and $\mp$.}
We also experiment with other strategies of aggregating the predictions of different components, e.g., tuning the loss to balance direct attacks on $C_{adv}$ and each $\mp{}$ model. We find that this attack formulation performs worse than attacking the entire pipeline with Equation~\ref{eqn:adaptive_softmax}. We provide a discussion on this attack in Appendix~\ref{app:adaptive}.


\section{Experiments}
\label{sec:exp}

In this section, we present our results on MNIST and CIFAR-10 datasets, both for the perturbation classifier $C_{adv}$ alone, and for the entire {\ours} pipeline.


\subsection{Perturbation Categorization by \texorpdfstring{$C_{adv}$}{C\_adv}}





\textbf{Categorizing $\ell_p$ perturbations.}
\label{sec:res-perturbation-classification}
First, we justify our choice of $\epsilon_p$ radii by empirically quantifying the overlapping regions of different types of adversarial attacks. We observe that the empirical overlap is exactly 0\% in all cases on both MNIST and CIFAR-10, and we present the full analysis in Appendix~\ref{app:perturb_overlap_stats}.
We then evaluate the categorization performance of $C_{adv}$ on a dataset of adversarial examples which are generated against the six models we use as the baseline defenses in our experiments. Note that $C_{adv}$ is only trained on adversarial examples against the two $\mp{}$ models that are part of \ours{}.



Next, we evaluate the test set generalization across the various datasets created. We observe that $C_{adv}$ transfers well across the board. First, $C_{adv}$ generalizes to adversarial examples against new models, i.e., it preserves a high accuracy, even if the adversarial examples are generated against models that are unseen during training. 
Further, $C_{adv}$ also generalizes to new attack algorithms. As discussed in Section~\ref{subsec:dataset-creation}, we only include PGD adversarial examples in our training set for $C_{adv}$. However, on adversarial examples generated by the AutoAttack library, the classification accuracy of $C_{adv}$ still holds up. 
In particular, the accuracy is $>95\%$ across all the individual test sets created. 
These results suggest two important findings that validate our results in Theorem~\ref{thm:separability} --- independent of \textbf{(a)} the model to be attacked; and \textbf{(b)} the algorithm for generating the optimal adversarial perturbation, the optimal adversarial images for a given $\ell_p$ region follow similar distributions. We present the full results in Appendix~\ref{app:subsec:c-adv}. 


\begin{table}[t]
\centering
\caption{Generalization results when $C_{adv}$ is trained on different \textcolor{red}{Noise}, \textcolor{darkgreen}{Blur}, \textcolor{darkblue}{Weather} and \textcolor{darkorange}{Digital} corruptions (Severity=5). Test is performed on \textcolor{red}{Speckle Noise} + \textcolor{darkgreen}{Gaussian Blur}~+~\textcolor{darkblue}{Spatter}~+~\textcolor{darkorange}{Saturate}.}
\label{tab:corruptions-generalization}
 \scalebox{0.9}{
\begin{tabular}{l|r}
\hline
Trained On & Accuracy\\
\hline
\textcolor{red}{Impulse} + \textcolor{darkgreen}{Defocus Blur} + \textcolor{darkblue}{Snow} + \textcolor{darkorange}{Brightness}            &  70.4\% \\
+ \textcolor{red}{Gaussian} + \textcolor{darkgreen}{Glass Blur} + \textcolor{darkblue}{Fog} + \textcolor{darkorange}{Contrast} & 80.1\% \\
+ \textcolor{red}{Shot} + \textcolor{darkgreen}{Motion Blur} + \textcolor{darkblue}{Frost} + \textcolor{darkorange}{Elastic Trans}     & 85.6\% \\
+ \textcolor{darkgreen}{Zoom Blur}  + \textcolor{darkorange}{JPEG Compression} + \textcolor{darkorange}{Pixelate}                & 93.5\% \\
+ \textcolor{red}{Speckle} + \textcolor{darkgreen}{Gaussian Blur} + \textcolor{darkblue}{Spatter} + \textcolor{darkorange}{Saturate}       & 99.8\% \\
\hline
\end{tabular}}
\end{table}

\textbf{Categorizing common corruptions.}
\label{subsubsec:common-corruptions}
CIFAR-10-C is a benchmark consisting of 19 different types of common corruptions~\citep{hendrycks2019benchmarking}. For each image in the original CIFAR-10 test set, CIFAR-10-C includes images with different corruptions.
To train the corruption classifier, we split CIFAR-10-C, so that each corruption type has 9K training samples, and 1K for testing. 
For corruptions of the highest severity, we observe that our corruption classifier achieves greater than 99\% test accuracy on the test split. Details about the architecture are deferred to Appendix~\ref{app:architecture}.
This demonstrates that our perturbation classifier is applicable to both $\ell_p$ adversarial perturbations and semantic common corruptions. We discuss detailed results of corruption classification at various severity levels in Appendix~\ref{app:common-corruptions}. 

\begin{table*}[t]
  \caption{Worst-case accuracies against different $\ell_p$ attacks: (a) MNIST; (b) CIFAR-10. \emph{Ours} represents \ours{} against the adaptive attack strategy (Eq~\ref{eqn:adaptive_softmax}), and~\emph{Ours*} is the standard setting.}
  \label{tab:res}
  \centering
  \scalebox{1.0}{
  \begin{subtable}{\textwidth}
  \centering
  \begin{tabular}{l|rrrrrrrrrr}
    \hline
        \textbf{MNIST}                  & $M_{\ell_\infty}$ & $M_{\ell_2}$ & $M_{\ell_1}$ & MAX & AVG & MSD &Ours &Ours*\\
    \hline 
    Clean accuracy                          & 99.2\% &	98.7\% &	98.8\% &	98.6\% &	99.1\% &	98.3\%  &98.9\% &98.9\%\\
    \hline
    $\ell_\infty$ attacks $(\epsilon=0.3)$ & 90.2\% &	2.6\% &	0.0\% &	39.0\% &	57.8\% &	63.5\%      
    &78.1\% &79.0\%\\
    
    
    $\ell_2$ attacks $(\epsilon=2.0)$      & 9.5\% &	72.3\% &	47.8\% &	58.5\% &	58.6\% &	65.7\% 
    &66.6\% &72.3\%\\
    $\ell_1$ attacks $(\epsilon=10)$       & 18.8\% &	70.6\% &	77.5\% &	41.8\% &	46.1\% &	64.3\% 
    &68.1\% &72.5\%\\
    \hline
    All attacks                            & 7.3\% &	2.6\% &	0.0\% &	29.1\% &	37.1\% &	57.2\% &\textbf{63.6}\% &\textbf{67.2}\%\\
    \hline
  \end{tabular}
  \caption{}
  \label{tab:res-mnist}
  \end{subtable}}
  
 \scalebox{1.0}{
  \begin{subtable}{\textwidth}
  \centering
  \begin{tabular}{l|rrrrrrrr}
    \hline
       \textbf{CIFAR-10}                             & $M_{\ell_\infty}$ & $M_{\ell_2}$ & $M_{\ell_1}$ & MAX & AVG & MSD & Ours &Ours*\\
    \hline
    Clean accuracy                           & 89.5\% &	93.9\% &	89.0\% &	81.0\% &	84.6\% &	81.7\% &	89.0\% &	89.0\% \\
    \hline
    $\ell_\infty$ attacks $(\epsilon=0.03)$& 59.3\% &	34.8\% &	35.0\% &	34.9\% &	39.7\% &	43.7\% &	56.1\% &	58.4\% \\
    $\ell_2$ attacks $(\epsilon=0.5)$      & 64.6\% &	77.2\% &	71.5\% &	61.8\% &	65.5\% &	64.5\% &	69.3\% &	69.4\% \\
    $\ell_1$ attacks $(\epsilon=10)$       & 27.6\% &	45.3\% &	60.9\% &	43.7\% &	60.0\% &	56.1\% &	57.9\% &	59.5\% \\
    \hline
    All attacks                            &27.6\% &	32.9\% &	35.0\% &	31.5\% &	39.3\% &	43.5\% &	\textbf{53.5\%} &	\textbf{54.9\% }\\
    \hline
  \end{tabular}
  \caption{}
  \label{tab:res-cifar10}
  \end{subtable}}
\end{table*}

\textbf{Generalization to unseen corruptions.}
We further evaluate the generalization of the perturbation classifier to unseen corruption types. Specifically, different from the above setting of classifying corruption types, now our classifier categorizes all corruption types into 4 categories --- noise, blur, digital, and weather (as defined in the CIFAR-10-C benchmark).
We evaluate the model performance on 4 held-out corruption types, 1 for each category, and select these corruption types following the model validation setting in~\citet{hendrycks2019benchmarking}.
From the remaining 15 corruption types, we vary the number of corruptions included for training, and present the results in Table~\ref{tab:corruptions-generalization}. 
We observe that even if we do not train the perturbation classifier on the same corruption types for testing, the classifier still obtains a high generalization accuracy ($>90\%$). These results demonstrate that perturbation classification is effective even for unseen perturbations.




\begin{table*}[t]
\caption{
Worst-case accuracies against $\ell_\infty \; (\epsilon=0.003)$, $\ell_2 \; (\epsilon=0.5)$, spatial and recolor attacks. \emph{Ours} represents \ours{} against the adaptive attack strategy (Eq~\ref{eqn:adaptive_softmax}), and~\emph{Ours*} is the standard setting. PAT~\citep{laidlaw2021perceptual} is trained using perceptual adversarial training.
}
  \centering
  \scalebox{1.0}{
  \begin{tabular}{l|rrrrrrrrr}
    \hline
       \textbf{CIFAR-10}   & $M_{\ell_\infty}$  & $M_{\ell_2}$  & $M_\st{}$ & $M_\re{}$ & MAX & AVG & PAT & Ours & Ours*\\
    \hline
    Clean acc.          & 89.5\%                   & 93.9\%       & 86.2\% & 93.4\% & 84.0\% & 86.8\% & 71.6\% & 89.5\% & 89.5\% \\
    \hline
    $\ell_\infty$ attacks & 59.3\%           & 34.8\%    & 0.1\%  & 8.5\%  & 25.8\% & 42.1\% & 29.8\% & 58.2\% & 59.1\% \\
    $\ell_2$ attacks      & 64.6\%           & 77.2\%    & 10.0\% & 34.8\% & 44.2\% & 64.8\% & 54.1\% & 57.0\% & 57.2\% \\
    $\st{}$       & 5.7\%            & 0.2\%     & 68.9\% & 0.0\%  & 46.2\% & 27.8\% & 58.4\% & 50.4\% & 55.7\% \\
    $\re{}$       & 85.5\%           & 84.0\%    & 52.1\% & 86.8\% & 77.4\% & 80.5\% & 70.9\% & 85.2\% & 85.3\%\\
    \hline
    All attacks                            & 5.4\%            & 0.2\%     & 0.1\%  & 0.0\%  & 24.0\% & 21.5\% & 27.8\% & \textbf{40.9\%} & \textbf{41.9\%} \\
    \hline
  \end{tabular}}
  \label{tab:PAT}
  \end{table*}


\subsection{Robustness to $\ell_p$ attacks}
\label{sec:res-pipeline}
\label{subsec:exp-setup}
\textbf{Baselines.} We compare \ours{} with the state-of-art defenses against the union of $\ell_1, \ell_2, \ell_\infty$ adversaries. For~\citet{tramer2019adversarial}, we compare two variants of adversarial training: (1) the~\textbf{MAX} approach, where for each image, among different perturbation types, the adversarial sample that leads to the maximum increase of the model loss is augmented into the training set; (2) the~\textbf{AVG} approach, where adversarial examples for all perturbation types are included for training. We also compare with~\textbf{MSD}~\citep{maini2019adversarial}, which modifies the standard PGD attack to incorporate the union of multiple perturbation types within the steepest decent. In addition, we evaluate $\mathbf{M_{\ell_1}, M_{\ell_2}, M_{\ell_\infty}}$ models trained with $\ell_1, \ell_2, \ell_\infty$ perturbations separately, as described~in~Appendix~\ref{app:architecture}. 

\textbf{Attack evaluation.} We evaluate against the strongest attacks in the adversarial examples literature, and with adaptive attacks specifically designed for \ours{} (Section~\ref{subsec:adaptive}). We perform standard PGD attacks along with attacks from the AutoAttack library \citep{croce2020reliable}, which achieves the state-of-art adversarial error rates against multiple recently published models. The radius of the $\{\ell_1, \ell_2, \ell_\infty\}$ perturbation regions is $\{10,2,0.3\}$ for the MNIST dataset and $\{10,0.5,0.03\}$ for the CIFAR-10 dataset. We present the full details of attack~algorithms~in~Appendix~\ref{app:attacks_used}. 


Following prior work, we evaluate models on adversarial examples generated from the first 1000 images of the test set for MNIST and CIFAR-10. Our main evaluation metric is the accuracy on \emph{all attacks} -- a given input is a failure case if any of the attack algorithm in our suite successfully fools the model.

\textbf{Results.} 
In Table~\ref{tab:res}, we summarize the worst-case performance against all attacks of a given perturbation type for MNIST and CIFAR-10 datasets. In particular, ``Ours'' denotes the robustness of \ours{} against the adaptive attacks described in Section~\ref{subsec:adaptive}, and ``Ours*'' denotes the robustness of \ours{} against standard attacks based on Equation~\ref{eqn:ctp}. The adaptive strategy effectively reduces the overall accuracy of {\ours} by $2-5\%$, showing that incorporating the gradient and prediction information of all second-level predictors results in a stronger attack.

\ours{} outperforms all baselines by $6.4\%$ on MNIST, and $10\%$ on CIFAR-10 in terms of the~\emph{all attacks} metric, even when evaluated against a strong adaptive adversary. Compared to the previous state-of-art defense against multiple perturbation types (MSD), the accuracy gain on $\ell_\infty$ attacks is especially notable, i.e., around $15\%$. In particular, if we compare the performance on each individual attack algorithm, as shown in Appendix~\ref{app:mnist} and~\ref{app:cifar10} for MNIST and CIFAR-10 respectively, the average accuracy gain is $\sim15\%$ for both datasets. These results demonstrate that {\ours} considerably mitigates the trade-off in the accuracy for individual attacks.
Further,
{\ours} retains a $7\%$ higher CIFAR-10 accuracy on \textit{clean images}, as opposed to past defenses that sacrifice benign 
accuracy for robustness to multiple perturbation types. 

\subsection{Robustness to non-$\ell_p$ attacks}
We demonstrate how \ours{} can be extended to perturbation types beyond those restricted to $\ell_p$ types. \citet{laidlaw2021perceptual} evaluate the robustness of various adversarial defenses against attacks $\mathcal{A} \in \mathcal{S} = \{\ell_2, \ell_\infty, \st{}, \re{}\}$ on CIFAR-10. We directly compare \ours{} with the pre-trained models for each individual defense provided in their work. This includes their defense based on perceptual adversarial training (\textbf{PAT}) and the \textbf{MAX}, \textbf{AVG} models, along with perturbation-specific robust models \textbf{$\mp$}. Specifically, as discussed in Section~\ref{subsec:dataset-creation}, we train a perturbation classifier that classifies adversarial examples as belonging to one of the two classes: $\{\{\ell_\infty,\ell_2,\re{}\},\st{}\}$. We use two individual robust predictors: $\{M_{\ell_\infty}, M_\st{}\}$. The choice is once again made based on the robust accuracy of $M_{\ell_\infty}$ models against $\{\ell_\infty,\ell_2,\re{}\}$ attacks as also presented in Table~\ref{tab:PAT}. This ability to combine attacks also represents positively on the scalability of \ours{}. \ours{} improves by $13.1\%$ against the union of all attacks. Importantly, \ours{} preserves a high accuracy against benign samples, whereas PAT classifies only $71.6\%$ of unperturbed samples correctly,
which makes it difficult to adopt it in real-world settings.




\section{Conclusion}
In this work, we introduce the problem of categorizing perturbation types. 
We theoretically demonstrate that adversarial inputs of different attack types are separable, and empirically validate our claims on different $\ell_p$ and non-$\ell_p$ attacks. 
In addition to categorizing them with high accuracy,  the perturbation categorizer also generalizes to \emph{unseen} corruptions of the same category.

{\ours} performs perturbation type categorization to achieve robustness against the union of multiple perturbation types.
We theoretically examine the existence of a natural tension for any adversary trying to fool our model---between fooling the attack classifier and the specialized robust predictors. Our empirical results on MNIST and CIFAR-10 datasets complement our theoretical analysis, showing that {\ours} outperforms existing defenses against multiple $\ell_p$ and non-$\ell_p$ attacks by over $5\%$, while showing gains of over $\sim15\%$ on average and clean accuracy metrics.


Our work serves as a stepping stone towards the goal of universal adversarial robustness, by dissecting multiple adversarial objectives into individually solvable pieces and combining them via \ours{}.
In its present form, \ours{} requires the knowledge of each individual attack type that we want to be robust against---to train the perturbation classifier. 
This limitation opens up various avenues for future work, including the new problem of perturbation categorization by defining sub-classes of adversarial attack types, and training generative models to~synthesize~diverse~perturbations. 



\begin{acknowledgements} %
This material is in part based upon work supported by the National 
Science Foundation under Grant No. TWC-1409915, Berkeley DeepDrive, and DARPA D3M under Grant No. FA8750-17-2-0091.
Any opinions, findings, and conclusions or recommendations expressed
in this material are those of the author(s) and do not necessarily
reflect the views of the National Science Foundation. Xinyun Chen is supported by the Facebook Fellowship.
\end{acknowledgements}

\bibliography{paper}

\end{document}
