%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version;
% also before submission to see how the non-anonymous paper would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
\usepackage{amsmath,amsfonts,bm}
\usepackage{url}
\usepackage{color}
\hypersetup{
colorlinks=true,citecolor=blue
}
%\usepackage{cite}
%\usepackage[sort&compress, numbers]{natbib}
%\usepackage[authoryear,square]{natbib}
\usepackage{natbib}
\setcitestyle{authoryear,round}
\usepackage{adjustbox}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{algorithm}
\usepackage{amsmath,bm}
\usepackage{bbm}
\usepackage{multirow}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Neural Architecture Search Finds Robust Models by Knowledge Distillation}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Utkarsh Nath}
\author[1]{Yancheng Wang}
\author[1]{Yingzhen Yang}
% Add affiliations after the authors
\affil[1]{%
School of Computing and Augmented Intelligence\\
Arizona State University, Tempe, AZ 85281, USA \\
\texttt{\{unath,ywan1053,yingzhen.yang\}@asu.edu}
}
\input{self_defined_notations}
  \begin{document}
\maketitle

\begin{abstract}
% Original Abstract
% Deep Neural Networks are often vulnerable to adversarial attacks. Neural Architecture Search (NAS), one of the tools for developing novel deep neural architectures, demonstrates superior performance in prediction accuracy in various machine learning applications. However, the performance of a model architecture learnt via NAS against adversarial attacks has not been sufficiently studied. Given the presence of a robust teacher, we investigate if NAS would produce a robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that improves the robustness of NAS by learning from a robust teacher through cross-layer knowledge distillation. Unlike previous knowledge distillation methods that encourage close student-teacher output only in the last layer, RNAS-CL automatically searches for the best teacher layer to supervise each student layer. Experimental results demonstrate the effectiveness of RNAS-CL, and show that RNAS-CL produces compact and adversarially robust neural architectures. Our results point to new approaches for compact and robust neural architecture development for many applications. The code of RNAS-CL is available at \url{https://anonymous.4open.science/r/RNAS-CL-A2EE/}.

Despite their superior performance, Deep Neural Networks (DNNs) are often vulnerable to adversarial attacks. Neural Architecture Search (NAS), a method for automatically designing the architectures of DNNs, has shown remarkable performance across various machine learning applications. However, the adversarial robustness of architectures learned by NAS against adversarial threats remains under-explored. By integrating a robust teacher, we examine whether NAS can yield a robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that enhances the robustness of architectures learned by NAS through employing cross-layer knowledge distillation from a robust teacher. Distinct from previous knowledge distillation approaches that only align student-teacher outputs at the final layer, RNAS-CL dynamically searches for the optimal teacher layer to guide each student layer. Our experimental findings validate the effectiveness of RNAS-CL, demonstrating that it can generate both compact and adversarially robust neural architectures. Our results pave the way for developing new strategies for compact and robust neural architecture design applicable across various fields. The code of RNAS-CL is available at \url{https://github.com/Statistical-Deep-Learning/RNAS-CL}.
\end{abstract}
\section{Introduction}
\label{intro}
Neural Architecture Search (NAS) has emerged as a vital tool for fostering advancements in deep neural networks, enhancing state-of-the-art (SOTA) performance across various fields, such as computer vision and natural language processing. NAS methods automate the search for neural architectures based on predefined criteria, eliminating the need for labor-intensive and time-consuming manual architecture design. Early works on NAS utilized Evolutionary Algorithms (EA) \citep{real2017largescale} and Reinforcement Learning (RL) \citep{zoph2017neural, tan2019mnasnet}. Despite their effectiveness, these methods require substantial computational resources. For example, some of these approaches require up to thousands of GPU days to reach SOTA performance for the image classification task on the ImageNet dataset. To overcome these challenges, recent works \citep{liu2018darts, cai2019proxylessnas, wu2019fbnet, wan2020fbnetv2, nath2020adjoined} represent architectures with a shared-weight supernet and refine the weights through gradient descent. The architectures identified by the architecture parameters in the supernet through NAS deliver two key benefits. First, they are optimized for both speed and size, enhancing their practical utility. Second, the searched architectures set new SOTA performance for a variety of computer vision tasks. Both advantages make NAS incredibly useful for real-world applications. Nonetheless, most NAS methods focus primarily on optimizing accuracy, parameters, or FLOPs, and the performance of searched architectures under adversarial attacks remains underexplored, which is crucial for implementing secure and resilient machine learning systems.  Few studies \citep{yue2022effective, ning2020discovering, xie2021tiny} have examined NAS with the aim of enhancing both adversarial robustness and efficiency. In this paper, we introduce RNAS-CL, a NAS methodology that concurrently optimizes for accuracy, latency, and defense against adversarial attacks, without the need for robust training.


Adversarial examples are created by altering the inputs, typically by introducing small, intricate disturbances into a clean image, causing the model to incorrectly classify it. It is well-known that almost all deep neural networks are vulnerable to these adversarial attacks \citep{szegedy2013intriguing}. Consequently, assessing the resilience of models to adversarial attacks is of paramount importance. Models that can withstand adversarial attacks are essential for critical applications such as autonomous driving, healthcare, and physical security systems.

\begin{figure}[t]
        \begin{center}
            \includegraphics[ trim=0 0 0 0, height=130pt] {Images/comparison_plot.eps}
        \end{center}
        \caption{The figure compares various SOTA efficient and robust methods on CIFAR-10. Clean Accuracy represents top-1 accuracy on clean images. Adversarial Accuracy represents top-1 accuracy on images perturbed by PGD attack. Larger marker size indicates larger architecture. The numbers in brackets represent the number of parameters and MACs, respectively.}
        \label{fig:comparison_plot}
\end{figure}

Adversarial training is a well-established strategy to enhance the defense mechanisms of models against adversarial attacks \citep{goodfellow2014explaining, madry2017towards, kannan2018adversarial, tramer2017ensemble, zhang2019theoretically}. Approaches in this category usually train the models on adversarial examples, typically generated using techniques such as the fast gradient sign method (FGSM) \citep{goodfellow2014explaining} or projected gradient descent (PGD) \citep{madry2017towards}. Other defense strategies include training models with specialized loss functions or regularization \citep{cisse2017parseval, hein2017formal, yan2018deep, pang2019rethinking}, preprocessing inputs prior to model input \citep{dziugaite2016study, guo2017countering, xie2019improving}, and employing model ensembles \citep{kurakin2018adversarial, liu2018towards}.

Recent studies have also highlighted the role of network architecture in influencing adversarial robustness \citep{madry2017towards, guo2020meets, su2018robustness, xie2019intriguing, huang2021exploring}. Inspired by these insights, we introduce Robust Knowledge Distillation for Neural Architecture Search (RNAS-CL). To the best of our knowledge, our work is among the first method that employs knowledge distilled from a robust teacher model to discover a robust architecture. Knowledge distillation traditionally involves transferring knowledge from a complex teacher model to a simpler student model using the teacher's outputs as "soft labels" \citep{hinton2015distilling}. However, beyond the outputs, the teacher’s intermediate layers offer valuable attention information, where each layer focuses on different aspects of the input \citep{zagoruyko2016paying}.

The central question of our investigation is: \textit{can a robust teacher improve the robustness of the student model by providing information about where to look, i.e., where to pay attention?}
The proposed RNAS-CL method confirms this, enabling the student model to learn not only from the teacher's outputs but also "where to look" from the teacher’s layers.  Given the disparity in the number of layers between the teacher and student models, it is crucial for the student to identify the most beneficial teacher layer to learn from. The RNAS-CL method also involves searching for the ideal teacher layer for each student layer.

Furthermore, inspired by recent progress in self-supervised and semi-supervised learning that emphasizes consistency between predictions from various augmented views, we propose a novel Confidence-Aware Consistency loss or CAC loss. The CAC loss aims to maximize prediction consistency between adversarial and original views of inputs. CAC is compatible with various adversarial training methodologies, such as TRADES. The experimental results demonstrate that RNAS-CL significantly surpasses most existing models without adversarial training in robust accuracy on the CIFAR-10 dataset. Furthermore, applying CAC and TRADES to adversarially train RNAS-CL models significantly enhances their robustness. The effectiveness of RNAS-CL extends to promising results on the large-scale ImageNet dataset as well.

\vspace{-2mm}
\subsection{Contributions}

Our contributions are detailed as follows.

First, we propose RNAS-CL -- a novel method for searching neural architectures that optimize the trade-off between robustness and prediction accuracy in a differentiable way. To the best of our knowledge, RNAS-CL is the first NAS approach that simultaneously optimizes for robustness and prediction accuracy without the necessity of robust training. By incorporating a penalty on model size and inference cost, the architectures derived through RNAS-CL are more compact than those from other NAS methods. We compare RNAS-CL against other models known for their computational efficiency and robustness \citep{sehwag2020hydra, ye2019adversarial, gui2019model, goldblum2020adversarially, dong2020adversarially, huang2021exploring}. RNAS-CL models of comparable size demonstrate superior performance in both clean and PGD accuracy on the CIFAR-10 dataset.



Second, we extend the field of Knowledge Distillation (KD) within the framework of NAS. Unlike traditional KD, which relies on fixed connections between the teacher and student models, RNAS-CL innovates by introducing learnable connections between layers of the teacher and the student models. This advancement not only enhances the efficacy of KD but also provides insights into the development of future adversarially robust NAS methods.


\section{Related Work}
\subsection {Knowledge Distillation}
Knowledge Distillation (KD) involves transferring knowledge from a larger, more complex model to a smaller, more manageable one. \cite{hinton2015distilling} introduced the concept of the teacher-student model, utilizing soft targets from the teacher to train the student model. This approach encourages the student to generalize in a manner similar to the teacher. Since this foundational work, various KD variants have been developed \citep{romero2014fitnets, yim2017gift, zagoruyko2016paying, li2019layer, tian2019contrastive, sun2019deeply}, incorporating feature maps, attention maps, or contrastive learning elements. FitNets \citep{romero2014fitnets} pioneered the use of intermediate-level hints from the teacher model to enhance student model training. This method involves a two-stage training process where the student first learns to predict the output of a middle (hint) layer of the teacher, followed by fine-tuning with the standard KD optimization function. The introduction of intermediate hints allowed the student model to achieve improved performance with fewer parameters. The introduction of intermediate hints allowed the student model to achieve improved performance with fewer parameters. Moving a step further, \citep{yim2017gift}, \citep{zagoruyko2016paying} and \citep{li2019layer} utilize information from multiple teacher layers to guide the student's training. \citep{yim2017gift} utilized a Gramian matrix between the outputs of the first and last layers to chart the problem-solving process, transferring knowledge by minimizing the distance between the student’s and teacher's flow matrices.  \citep{li2019layer} calculated inter-layer and inter-class Gramian matrices to identify the most representative layers, minimizing the distance between these key layers of both student and teacher. \citep{zagoruyko2016paying} focused on minimizing the distance between the attention maps of the teacher and student at various blocks. In contrast with the above methods, RNAS-CL aims to map each student layer to a corresponding teacher layer, optimizing the match for each pair. This method extends the concept of attention map alignment, similar to that in \citep{zagoruyko2016paying}, by minimizing the distance between the attention maps of matched student-teacher layers. This comprehensive mapping ensures a more detailed and effective knowledge transfer throughout the student’s architecture.

\subsection{Neural Architecture Search}
Neural Architecture Search (NAS) is a method that automates the design of neural networks without human intervention. Traditionally, finding the optimal architecture within a given search space involves training each potential architecture from scratch until convergence. This approach, while straightforward, is computationally prohibitive. Early NAS efforts employed Reinforcement Learning (RL) \citep{zoph2017neural, tan2019mnasnet} and Evolutionary Algorithms (EA) \citep{real2017largescale}, but these methods also demanded significant computational resources.
More recent advancements \citep{liu2018darts, cai2019proxylessnas, wu2019fbnet} have introduced the concept of a weight-sharing super-network, which encompasses all candidate architectures. This network is over-parameterized and includes distinct paths for each architecture, each path having its own set of weights. These weights are then optimized through gradient descent during training to eventually select a single, optimal architecture. This selected network is subsequently trained in a conventional manner. While these methods have achieved state-of-the-art (SOTA) results on various classification tasks, their vulnerability to adversarial attacks remains largely unexplored.
Research \citep{devaguptapu2021adversarial, guo2020meets, li2021neural, madry2017towards, su2018robustness, xie2019intriguing, huang2021exploring} has shown that network architecture significantly influences adversarial robustness. Studies like \citep{devaguptapu2021adversarial} have noted that handcrafted architectures tend to be more resilient against adversarial attacks compared to NAS-generated models. Moreover, it has been empirically observed that larger models generally exhibit greater robustness against such attacks. \citep{guo2020meets} found that architectures with dense connections are particularly resistant to adversarial threats, prompting them to devise a NAS strategy that includes adversarial training on a supernet followed by the selection of densely connected architectures.
\citep{li2021neural} expanded the backbone network to maintain accuracy while optimizing both the architecture and its parameters through adversarial training. Although this approach shows promising results, the main downside is that adversarial training is time-intensive and tends to degrade performance on standard (clean) images. Our proposed RNAS-CL method stands out by optimizing for both robustness and prediction accuracy without the need for adversarial training.


 \subsection{Efficient and Robust models}
%  The deep learning research community has extensively studied building efficient and adversarially robust models individually. However, few works combine both domains, that is, building an efficient model which is also adversarially robust.
% \citep{sehwag2020hydra} proposed to make the pruning technique aware of the robust training objective. They formulate pruning as an empirical risk minimization (ERM) problem and integrate it with a robust training objective. \citep{huang2021exploring} investigated the impact of network width and depth configurations on the robustness of adversarial-trained DNNs. They observed that reducing capacity at last blocks improves adversarial robustness. \citep{goldblum2020adversarially}, proposed Adversarially Robust Distillation (ARD), where they encouraged student networks to mimic their teacher's output within an $\epsilon$-ball of training samples. Furthermore, there are few NAS methods \citep{yue2022effective, ning2020discovering, xie2021tiny} that jointly optimize accuracy, latency, and robustness. \citep{ning2020discovering} trained a multi-shot NAS method to search for adversarially robust architectures.
% They interpolate multiple one-shot methods to find architecture at the targeted capacity. \citep{xie2021tiny, yue2022effective} proposed a one-shot NAS method that selects an efficient model from the adversarially trained supernet. Compared to these methods, similar-sized RNAS-CL models achieve higher accuracy for both clean and and advarsarial images.

The deep learning research community has thoroughly investigated the creation of efficient models and adversarially robust models as separate endeavors. However, integrating these two domains, that is, developing models that are both efficient and adversarially robust, has seen limited exploration. \citep{sehwag2020hydra} introduced an approach to make pruning techniques sensitive to robust training objectives. They framed pruning as an empirical risk minimization (ERM) problem and combined it with a robust training framework. \citep{huang2021exploring} examined how the configurations of network width and depth affect the robustness of adversarially trained deep neural networks (DNNs). They found that reducing the capacity of the final blocks of a network could enhance its adversarial robustness.
\citep{goldblum2020adversarially} developed Adversarially Robust Distillation (ARD), a method that prompts student networks to approximate their teacher’s output within $\epsilon$-ball of training samples, fostering robustness in the student models.
Additionally, a few Neural Architecture Search (NAS) methods \citep{yue2022effective, ning2020discovering, xie2021tiny} have aimed to optimize for accuracy, latency, and robustness concurrently. \citep{ning2020discovering} implemented a multi-shot NAS approach to identify architectures that are robust against adversarial attacks, blending multiple one-shot methods to target specific capacities. \citep{xie2021tiny, yue2022effective} employed a one-shot NAS technique that selects an efficient model from an adversarially trained supernet. In comparison to these methods, models developed using RNAS-CL achieve superior accuracy on both clean and adversarial images while maintaining comparable size, thus demonstrating the effectiveness of integrating robustness and efficiency in neural architecture design.

\begin{figure*}[!htbp]
                \centering
                \includegraphics[width=0.8\linewidth] {Images/RNAS-CL-v2.eps}
                % \large

                \caption{(a) Training paradigm based on RNAS-CL. We connect attention maps from each student layer to each robust teacher layer. For each student layer, we search for the optimum teacher layer. $g_{ij}$ represents gumbel weights associated between $i^{th}$ student layer and $j^{th}$ teacher layer. RNAS-CL induces robustness to the student model by searching for the optimum teacher layer. We also search for the number of filters in each layer to build an efficient model inspired by FBNetV2 \citep{wan2020fbnetv2}. (b) Sample attention maps corresponding to input Image (i) from low-level (ii), mid-level (iii), and high-level (iv) convolution layers.
  }

\label{fig:rkd_nas}
\end{figure*}


\section{Robust Knowledge Distillation for Neural Architecture Search}

We utilize knowledge distilled from a robust teacher model to facilitate the search for an architecture that achieves both robustness and efficiency. Knowledge distillation involves transferring knowledge from a larger teacher model to a smaller student model. In standard knowledge distillation, the teacher model’s outputs serve as "soft labels" for training the student model. However, valuable attention information is also contained in the intermediate features of the teacher, where different layers concentrate on distinct parts of the input object.
In RNAS-CL, the student model not only benefits from the teacher's soft labels but also learns where to direct its attention among the teacher's intermediate layers. Each student layer is specifically aligned with a robust teacher layer to learn targeted areas of focus. In Section \ref{subsection:intermediate_connections}, we discuss how we define attention maps. We hypothesize that learning directed attention from a robust teacher inherently enhances the student model's resistance to adversarial attacks. RNAS-CL is designed to identify the optimal tutor layer for each student layer while concurrently searching for an efficient architecture. In Section \ref{subsection:tutor_search} and \ref{subsection:architecture_search}, we discuss our tutor and architecture search algorithm. Similar to other state-of-the-art NAS methods \citep{liu2018darts, wu2019fbnet, wan2020fbnetv2}, RNAS-CL is structured around a search phase and a training phase. In the search phase, we optimize the neural architecture weights. In the training phase, the architecture selected in the search phase is trained using conventional methods.
In Section \ref{subsection:RKD_Loss}, the objectives for the search and training phases are elaborated. Although RNAS-CL can identify robust neural architectures for the student model, we aim to further enhance robustness through adversarial training.  In Section \ref{subsection:adversarial}, we introduce a novel regularization term, Confidence-Aware Adversarial Consistency Loss (CAC), which can be integrated with any adversarial training objective, such as TRADES and FastAT \citep{wong2020fast}, to increase the robustness of the model.


\subsection{Attention Map}
\label{subsection:intermediate_connections}

We focus on learning where to pay attention from a robust teacher model, specifically analyzing convolution layers with activation tensors represented as $A \in R^{C \times H \times W}$ where $C$ is the number of channels, and $H$ and $W$ represent the spatial dimensions. A mapping function $\mathcal{F}: \RR^{C \times H \times W} \longrightarrow \RR^{H \times W}$ is defined to convert the tensor $A$ into an attention map $\mathcal{F}(A) \in \RR^{H \times W}$ by $\left[\mathcal{F} (A)\right]_{hw} = {\sum_{c=1}^{C}} A_{c,h,w}^2$, where $A_{c,h,w}$ represents the element of $A$ with channel coordinate $c$ and spatial coordinates $h$ and $w$. This activation-based mapping function $\mathcal{F}$, which was introduced in \citep{zagoruyko2016paying}, is applied post each convolution layer to generate an attention map. The mapping function $\mathcal{F}$ is applied to activation tensors after each convolution layer to generate an attention map. Several attention maps are illustrated in Figure \ref{fig:rkd_nas}(b). RNAS-CL aims to match each student layer with a corresponding teacher layer, termed as a tutor, ensuring that the student's attention map closely resembles that of its designated tutor from the teacher model. Given that the dimensions of the student's attention map might differ from that of its tutor, we interpolate all attention maps to a standardized dimension to facilitate accurate comparisons and alignments

\subsection{Tutor Search}
\label{subsection:tutor_search}
We aim to identify an appropriate tutor (teacher layer) for each student layer, which instructs on where to pay attention to, the potential combinations create a vast search space. Each student layer has the option to select any of the teacher layers as its tutor. Such flexibility results in a search space that grows exponentially with the number of layers in each model. For instance, the search space of a student model with $20$ layers and a teacher model with $50$ layers is of size $50^{20}$. To reduce the computation cost of the search process, we adopt Gumbel-Softmax \citep{jang2016categorical} to search for the tutor for each student layer in a differentiable manner.
Given network parameter $v = [v_1, \ldots, v_n]$ and a constant $\tau$. The Gumbel-Softmax function is defined as $g(v) = [g_1, \ldots, g_n]$ where
$g_i = \frac{\exp[(v_i+ \epsilon_i)/\tau] }{\sum_i{\exp[(v_i + \epsilon_i)/\tau]}}$ and $\epsilon_i \sim N(0, 1)$ is the uniform random noise, which is also referred to as Gumbel noise.
When $\tau \rightarrow 0$, Gumbel-Softmax tends to the $\argmax$ function. Gumbel-Softmax is a ``re-parametrization trick'', that can be regarded as a differentiable approximation to the $\argmax$ function.

Consider a teacher model $T$ and a student model $S$, each consisting of $n_t$ and $n_s$ layers respectively. Let $A_t^i$ and $A_s^i$ represent the activation tensors of the $i^{th}$ layer in the teacher and student models. In the RNAS-CL framework, each student layer ($i$) is paired with $n_t$ Gumbel weights ($g_i$), where $g_i$ belongs to the set $R^{1 \times n_t}$. Denote $g_{ij}$ as the Gumbel weight linking the $i^{th}$ student layer to the $j^{th}$ teacher layer. The attention loss is then defined as follows:
\begin{multline}
\small
    L_{\textup{Attn}}(A_t, A_s) = \\ \frac{1}{n_s \times n_t}
    {\sum_{i=0}^{n_s}}{\sum_{j=0}^{n_t}} g_{ij}
    \bigg\lVert \frac{\mathcal{F}{(A_{s}^{i})}}{||\mathcal{F}{(A_s^i)}||_2}
    - \frac{\mathcal{F}{(A_{t}^{j})}}{||\mathcal{F}{(A_t^j)}||_2}
    \bigg\rVert_2,
        \label{eqn:attention_loss}
\end{multline}
\noindent
where $A_s$ and $A_t$ represent the activation tensors for all convolution layers in the student and teacher models, respectively. $\mathcal{F}$ is the mapping function as defined in Section~\ref{subsection:intermediate_connections}, $\|\cdot\|_2$ is the $\ell^2$-norm. Throughout the search process, we apply an exponential decay to the temperature $\tau$ of the Gumbel-Softmax, resulting in an encoding that closely approximates a one-hot vector.

\subsection{Architecture Search}
\label{subsection:architecture_search}
In addition to identifying the optimal tutor for each layer, we aim to develop an architecture that prioritizes efficiency and low latency. Drawing inspiration from FBNetV2 \citep{wan2020fbnetv2}, our search focuses on determining the ideal number of filters, or output channels, for each convolution block. Consider a set of filter options $A=\{f_1, f_2, ..., f_n\}$ and their corresponding outputs $\{z_1, z_2, ..., z_n\}$ for a convolution block. The cumulative output is then defined as $Z = {\sum_{i=1}^{n}} g_w^{(i)} z_i$, where $g_w^{(i)}$ represents the Gumbel weight associated with the $i^{th}$ filter choice. We optimize the number of FLOPs to achieve minimal latency, noting that FLOPs are directly proportional to the number of filters. This cumulative count of filters, influenced by the Gumbel weights, allows for differential optimization using SGD. Similar to the tutor search, the temperature decay is applied exponentially to secure an encoding nearing a one-hot vector.  Figure~\ref{fig:fbnet_v2} in the appendix illustrates the architecture search process by FBNetV2.

\subsection{RNAS-CL Loss}
\label{subsection:RKD_Loss}
Adhering to the practices of leading NAS methodologies \citep{liu2018darts,wu2019fbnet}, RNAS-CL incorporates distinct searching and training phases. During the search phase, the Gumbel weights and other model parameters are updated each epoch. These include the Gumbel weights $\left\{g_w^{(i)}\right\}$ associated with the student-teacher connections referenced in (\ref{eqn:attention_loss}), and the Gumbel weights $\left\{g_w^{(i)}\right\}$ for selecting filters as outlined in Section~\ref{subsection:architecture_search}. Optimization of these weights is conducted using the RNAS-CL search loss, which will be detailed subsequently.

\textbf{RNAS-CL search loss.} Let $y$ be the ground-truth one-hot encoded vector, $p$ and $q$ be output probabilities of the student and teacher network, and $A_s, A_t$ as the activation tensors for all student and teacher convolution layers. The RNAS-CL search loss is given by
\begin{multline}
L(y, p, q, A_t, A_s) = (-y\log p  + \thinspace \KL(p, q) \\ + \gamma_s L_{\textup{Attn}}(A_t, A_s)) n_f,
\label{eqn:rkd_search_loss}
\end{multline}
where $\KL(p, q) = \sum_i p_i \log \frac{p_i}{q_i}$ denotes the Kullback-Leibler (KL) divergence between the probability distributions. $\Lattn$ is the attention loss as defined in (\ref{eqn:attention_loss}) and $\gamma_s$ is a normalization constant. $n_f$ represents the latency, which is minimized through differential optimization as in \citep{wan2020fbnetv2}.

Upon completion of the search phase, a tutor layer $j^*$ is chosen for each student layer $i$, where $j^* = \argmax_j g_{ij}$. Additionally, the optimal filter choices for each convolution block, as discussed in Section~\ref{subsection:architecture_search}, are determined based on the highest Gumbel weights. Subsequent to the search, the training phase commences, wherein the searched architecture is trained utilizing the RNAS-CL training loss, which will be delineated subsequently.

\textbf{RNAS-CL train loss.} Let $y$ be the ground-truth one-hot encoded vector, $ p$ and $q$ be output probabilities of the student and teacher network, and $A_t, A_s$ be activation tensors for all student and teacher convolution layers. The training loss of RNAS-CL is
\begin{multline}
L(y, p, q, A_t, A_s) = L_{\textup{CE}}(y,p) + \thinspace \KL(p, q) \\ + \gamma_t L_{Attn}(A_t, A_s),
\label{eqn:rkd_train_loss}
\end{multline}
where $L_{\textup{CE}}(y,p) = -y\log p$ represents the cross-entropy loss, $\KL(p, q)$ denotes the KL-divergence, and $\gamma_t$ is a normalization constant. It should be noted that $g_i$ within $\Lattn$ is defined as a one-hot vector, leading to the optimization of each student attention map with respect to a specific tutor layer.

\subsection{Confidence-Aware Adversarial Consistency Loss}
\label{subsection:adversarial}
% Inspired by recent works in self-supervised learning~\citep{zhai2019s4l} and semi-supervised learning~\citep{berthelot2019mixmatch} that enforce consistency between predictions from different augmented views, we propose a consistency loss that maximizes the prediction consistency between the adversarial view and original view of input data. The optimization is only performed on samples that have high confidence in the prediction by the adversarial view. For an input image $x$, we first generate its adversarial view $x_{adv}$ and obtain the predictions of $x$ and $x_{adv}$ with the student network as $p$ and $p_{adv}$. Next, we take the average of $p$ and $p_{adv}$ as $\Bar{p} = \frac{p+p_{adv}}{2}$. Then we sharpen the average prediction $\Bar{p}$ by $\Tilde{p}_j = {\Bar{p}_j}^{\frac{1}{\tau}} / \sum_{k=1}^K {\Bar{p}_k}^{\frac{1}{\tau}}$ where $K$ is the number of classes, $\Tilde{p}_j$ is the $j$-th element of $\Tilde{p}$. $\tau \in \left(0,1\right]$ is the sharpening factor. $\Tilde{p}$ is close to one-hot distribution with small $\tau$. The sharpened $\Tilde{p}$ is regarded as a pseudo label for $x$ based on the predictions by both $x$ and $x_{adv}$. We aim to enforce consistency between $p$ and $p_{adv}$ by minimizing their distances to $\Tilde{p}$. Thus, the confidence-aware adversarial consistency loss is defined as

Motivated by studies in self-supervised learning~\citep{zhai2019s4l} and semi-supervised learning~\citep{berthelot2019mixmatch} that emphasize the alignment of predictions across varied augmented views, we introduce a consistency loss aimed at enhancing the agreement between predictions from both adversarial and original views of the input data. This loss function is applied selectively to samples where the adversarial view yields highly confident predictions. For an input image $x$, its adversarial counterpart $x_{adv}$ is first generated, followed by acquiring the predictions for both $x$ and $x_{adv}$ from the student network, denoted as $p$ and $p_{adv}$, respectively. We then compute the mean of these predictions as $\Bar{p} = \frac{p+p_{adv}}{2}$. Subsequently, this average prediction $\Bar{p}$ is refined through the formula $\Tilde{p}_j = {\Bar{p}j}^{\frac{1}{\tau}} / \sum{k=1}^K {\Bar{p}k}^{\frac{1}{\tau}}$, where $K$ is the total number of classes and $\Tilde{p}j$ is the $j$-th component of $\Tilde{p}$. Here, $\tau \in \left(0,1\right]$ serves as the sharpening parameter. As $\tau$ decreases, $\Tilde{p}$ approaches a one-hot distribution. This sharpened $\Tilde{p}$ is then used as a pseudo label for $x$, reflecting the collective predictions from both $x$ and $x{adv}$. Our objective is to fortify the consistency between $p$ and $p{adv}$ by minimizing their divergence from $\Tilde{p}$, thereby defining our confidence-aware adversarial consistency loss as
\begin{equation}
\small
    \Lcac(x) = \mathbbm{1}(\max(\Bar{p})\geq \gamma)
    \left( \KL(\Tilde{p}, p)+\KL(\Tilde{p}, p_{adv})\right),
\end{equation}
where $\mathbbm{1}(\cdot)$ represents the indicator function, and $\gamma \in \left[0,1\right)$ denotes the confidence threshold. In the context of $\Lcac$, consistency optimization between the predictions of an image and its adversarial view is conducted only if the maximum value in the prediction vector $\Bar{p}$ meets or exceeds $\gamma$. This condition ensures that $\Lcac$ enforces consistency solely on images where the predictions are deemed confident. The optimization of $\Lcac$ is designed to mitigate the detrimental effects of the noisy adversarial view, thereby enhancing the robustness of the student network. Our model undergoes adversarial training using $\Lcac$ alongside established adversarial objectives like TRADES and FastAT. The combined training loss for this adversarial training, incorporating both TRADES and $\Lcac$, is defined as
\begin{equation}
    \Ladv = \Lcac + \Ltrades + \Lkl + \gamma_t \Lattn,
    \label{eqn:adv_training}
\end{equation}
where $\Ltrades$ is TRADES optimization objective and $\Lkl, \gamma_t, \Lattn$ are the same as those in (\ref{eqn:rkd_train_loss}).



 \begin{figure*}[!htbp]
        \begin{center}
            \includegraphics[scale=0.5] {Images/rnas-cl-comparison-plot.eps}
        \end{center}

        \caption{The figure compares the performance of various efficient and robust methods on CIFAR-10 dataset. Clean Accuracy represents top-1 accuracy on clean images. Adversarial Accuracy represents 20 step PGD attack.}
        \label{fig:comparison_plot}
\end{figure*}


\section{Experiments}
\label{section:experiments}
In this section, we present experiments conducted on real-world datasets to demonstrate the effectiveness of our proposed framework. The structure of this section is organized as follows. In Section \ref{subsection:implementation_details}, we discuss the settings and the implementation details.
In Section~\ref{subsection:cifar},  RNAS-CL is compared against state-of-the-art efficient and robust models on CIFAR-10, with more results outlined which are deferred to the supplementary.


\subsection{Implementation Details}
\label{subsection:implementation_details}

In this paper, we assess the performance of RNAS-CL across three prominent public image classification benchmarks: (1) CIFAR-10, which includes $60k$ images distributed across 10 classes \citep{CIFAR10}; (2) ImageNet, a comprehensive image classification dataset \citep{russakovsky2015imagenet} with approximately 1.2M images spanning 1000 classes; and (3) ImageNet-100, a more focused subset of the ImageNet-1k dataset \citep{russakovsky2015imagenet}, featuring 100 classes and around $130k$ images \citep{tian2020contrastive}. We employ standard data augmentation techniques for each dataset, including random-resize cropping and random flipping. Initially, for each dataset, we undertake a searching step where our model is trained using the RNAS-CL search loss (\ref{eqn:rkd_search_loss}), aiming to identify optimal channel numbers and the appropriate connected teacher layers for each student layer. We explore various search spaces and utilize different robust teacher models throughout these experiments. In this paper, our model is denoted as RNAS-CL-X-T, where X indicates the search space and T denotes the robust teacher model. The search spaces are detailed in Table \ref{table:search_space_cifar} and Table \ref{table:search_space_imagenet}. We test four robust teacher models: ResNet-50, ResNet-18, WideResNet-50, and WideResNet-34, referred to as R-50, R-18, WRT-50, and WRT-34, respectively. For instance, {RNAS-CL-S3-R-18} describes a model trained within the S3 search space using a ResNet-18 as the adversarially robust teacher model.

For all three datasets, we employ the SGD optimizer. The momentum and weight decay default values for ImageNet and ImageNet-100 are set at $0.9$ and $4 \times 10^{-5}$ respectively. The batch size used is $256$, and the learning rate starts at $0.1$, gradually decreasing to zero according to a cosine schedule.  Post the search phase, which spans $100$ epochs, the identified architecture undergoes training from scratch for $200$ epochs using the RNAS-CL train loss (\ref{eqn:rkd_train_loss}). For CIFAR-10, the default momentum and weight decay values are $0.9$ and $2 \times 10^{-4}$ respectively, with a batch size of $128$. The model is trained over $100$ epochs in both the searching and training phases. The learning rate, initially set at $0.1$, is reduced by a factor of $10$ at the $75$-th and $90$-th epochs. In line with FBNetV2 settings, the Gumbel-Softmax temperature ($\tau$) starts at $5.0$ and is exponentially reduced by $e^{-0.045}$ each epoch during the search phase.
Hyperparameters $\lambda_s$ and $\lambda_t$ are maintained at $1.0$ across all experiments. During the search phase, $80\%$ of the data in each batch is used to optimize model weights while the remaining $20\%$ is employed for optimizing the architecture weights, the latter being Gumbel weights as discussed in Section~\ref{subsection:RKD_Loss}. For robustness assessment, we deploy five prominent attacks: FGSM \citep{goodfellow2014explaining}, MI-FGSM \citep{DongLPS0HL18}, PGD \citep{madry2017towards}, CW \citep{carlini2017towards}, and AutoAttack \citep{croce2020reliable}. Following standard practices in adversarial studies \citep{madry2017towards, ZhangYJXGJ19}, adversarial perturbations are configured under the $\ell_\infty$ norm with a maximum perturbation limit of $8/255 (=0.031)$.

%We ran our experiments on NVIDIA V100 machine using Pytorch.
% We have attached our source code with the supplementary materials.

\subsection{Experimental Results and Ablation Study}
\label{subsection:cifar}

In this section, we evaluate the robustness of our method compared to other state-of-the-art (SOTA) efficient and robust models. 
These results are visually illustrated in Figure \ref{fig:comparison_plot}, where RNAS-CL models are positioned in the top right corner, indicating that they are among the models with the highest clean and adversarial accuracy. As shown in Figure \ref{fig:comparison_plot}, we benchmark RNAS-CL against both efficient models that have undergone adversarial training and those that have not. All RNAS-CL models utilize a robust WideResNet-34 \citep{rice2020overfitting} as the teacher model. The results demonstrate that RNAS-CL substantially outperforms all models trained without adversarial interventions in terms of adversarial accuracy. Despite their smaller size, our RNAS-CL models achieve significantly better adversarial accuracy compared to their counterparts trained without adversarial measures. For instance, \textit{RNAS-CL-S7-WRT-34} achieves more than a $28\%$ higher PGD accuracy than most other models of comparable size.

Next, we extend our comparison of RNAS-CL to models that have been adversarially trained. To ensure a fair assessment, after the initial training phase, our RNAS-CL models undergo further enhancement with our specialized adversarial training loss (\ref{eqn:adv_training})for an additional 20 epochs. This phase of adversarial training boosts the adversarial accuracy of RNAS-CL models, allowing them to match or exceed the adversarial accuracy of other robustly trained models. Additionally, RNAS-CL models are significantly smaller and achieve considerably higher clean accuracy. For instance, the RNAS-CL-M-WRT-34 model not only matches but in some cases surpasses the adversarial accuracy of most other methods, while also being more compact and achieving significantly higher clean accuracy. Moreover, RNAS-CL enables the creation of notably smaller models. The Tiny RNAS-CL model, specifically RNAS-CL-S5-WRT-34, outperforms its counterpart, Hydra ResNet 34 \citep{sehwag2020hydra}, by over approximately $12\%$ in clean accuracy while maintaining the same model size.

%Detailed discussions on the outcomes for RNAS-CL models trained with various robust teachers can be found in Section \ref{subsection:more_cifar10}.

\textbf{Comparison against various perturbation budgets.}
To further highlight the efficacy of RNAS-CL, we contrast it with previously proposed defense mechanisms across a range of perturbation budgets. In Figure \ref{fig:ablation_study} of the supplementary, we illustrate the performance of various methods under PGD and FGSM attacks. For both types of attacks, RNAS-CL consistently outperforms its counterparts at every level of perturbation. Notably, as the size of the perturbation increases, the superiority of RNAS-CL becomes even more pronounced. Specifically, at a perturbation level of $\epsilon=0.1$, RNAS-CL surpasses other methods by approximately $20\%$ in terms of resistance to both PGD and FGSM attacks. This robust performance underscores the strength of RNAS-CL in maintaining higher adversarial accuracy under increasingly challenging conditions.

We present additional experimental results and an ablation study in Section~\ref{sec:more-results-supp} of the supplementary. In Section~\ref{subsection:comparison_kd}, our methods are benchmarked against various knowledge distillation techniques as detailed in \citep{park2019relational, ahn2019variational, tung2019similarity, tian2019crd, passalis2018learning}.  Section~\ref{subsection:cw_aa} evaluates RNAS-CL and the approach by \citep{huang2021exploring} against recent attacks such as $\textup{CW}_\infty$ \citep{carlini2017towards} and AutoAttack \citep{croce2020reliable} on the CIFAR-10 dataset. In Section~\ref{subsection:ImageNet}, we compare our model with the SOTA compact and efficient method \citep{huang2021exploring}, which is known for achieving one of the best PGD accuracies on ImageNet. Section~\ref{subsection:ablation_study} provides ablation studies highlighting the significance of student-teacher cross-layer connections in RNAS-CL. We outline three training paradigms: the first uses standard cross-entropy loss without any teacher model, referred to as standard; the second minimizes the cross-entropy loss and standard KL Divergence with a robust teacher model, denoted as KL-X-T, where X represents the search space and T is the teacher model; the third model type, RNAS-CL, incorporates all three terms: cross-entropy loss, KL Divergence, and cross-layer student-teacher connections. 

Moreover, in Section~\ref{sec:robust-teacher-model-supp} of the supplementary, we report the robustness of adversarially trained teacher models used throughout the paper on the CIFAR-10 dataset in Table \ref{table:robust_teacher}. In Section~\ref{sec:architecture-supp} and Section~\ref{sec:architecture-search-fbnetv2}, we discuss the architectures of various proposed supernets used in RNAS-CL for the CIFAR-10 dataset and outline the neural architecture search process based on FBNetV2.



\section{Conclusions}
In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that enhances the robustness of the student model through cross-layer knowledge distillation from a robust teacher. RNAS-CL optimizes neural architectures in a differentiable manner, aiming to balance robustness with clean accuracy, and can be employed with or without robust training. Our experiments demonstrate that compact models trained using RNAS-CL surpass those trained without robust measures in terms of adversarial robustness. Furthermore, incorporating adversarial training into RNAS-CL significantly boosts its adversarial resilience. Upon undergoing robust training, RNAS-CL models exhibit comparable adversarial robustness to those trained robustly from the outset, yet achieve superior clean accuracy. As a direction for future research, we plan to integrate robust training during the architecture search phase to further improve the robustness of the models.

\section*{Acknowledgments}
This material is based upon work supported by the U.S. Department of Homeland Security under Grant Award Number 17STQAC00001-07-00.
The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security.
This work is also partially supported by the 2023 Mayo Clinic and Arizona State University Alliance for Health Care Collaborative Research Seed Grant Program.
% References
\bibliography{main}

\newpage

\onecolumn

\title{Neural Architecture Search Finds Robust Models by Knowledge Distillation\\(Supplementary Material)}
\maketitle


\appendix

\section{Robust teacher models}
\label{sec:robust-teacher-model-supp}
In this section, we report the robustness of adversarially trained teacher models used throughout the paper on the CIFAR-10 dataset in Table \ref{table:robust_teacher}.
\begin{table}[hbt!]
\centering
\caption{Robustness results for various teacher models on the CIFAR-10 dataset.}
\label{table:robust_teacher}
%\resizebox{0.3\linewidth}{!}{
    \begin{tabular}{|c|c|c|}
        \hline
        Method & Clean & PGD$^{20}$ \\ \hline
        WRT-34 & 86.07 & 58.33\\
        ResNet 18   & 84.59 & 55.54 \\
        ResNet 50   & 87.03 & 49.25\\ \hline
    \end{tabular}
%}
\end{table}


\begin{table*}[h]
    \centering
    \caption{The table describes the search space for CIFAR-10. Depth represents the depth of each stage. For example, 3-3-3 represents three convolution blocks in each stage. All search spaces have three stages. Stage 1, Stage 2, and Stage 3 represent the filter choices for the corresponding stages. For example, at stage 3 of RNAS-CL-S3, we search among 4 output channels, (64, 60, 56, 52), for each convolution block.}

    \label{table:search_space_cifar}
    \resizebox{0.65\textwidth}{!}{
        \begin{tabular}{|c|c|c|c|c|}
        \hline
        Search Space & Depth & Stage 1 & Stage 2 & Stage 3 \\
        \hline
        RNAS-CL-S3 & 3-3-3 & 16, 12 & 32, 28, 24, 20 & 64, 60, 56, 52 \\
        RNAS-CL-S5 & 5-5-5 & 16, 12 & 32, 28, 24, 20 & 64, 60, 56, 52 \\
        RNAS-CL-S7 & 7-7-7 & 16, 12 & 32, 28, 24, 20 & 64, 60, 56, 52 \\
        RNAS-CL-M & 9-7-1 & 80, 76 & 160, 156, 152, 148 & 128, 124, 120, 116 \\
        RNAS-CL-L & 9-7-1 & 160, 156 & 320, 316, 312, 308 & 256, 252, 248, 244 \\
        \hline
        \end{tabular}
    }
\end{table*}
%\vspace{-.1in}
\begin{table*}[h]
    \centering
    \caption{The table describes the search space for ImageNet and ImageNet-100. Similar to Table \ref{table:search_space_cifar}, depth represents the depth of each stage. For ImageNet, we have up to 5 stages. Stage 1, Stage 2, Stage 3, Stage 4, and Stage 5 represent the filter choices for their respective stages. For example, in stage 1, we search among 4 output channel options, (28, 24, 20, 16), for each convolution block.}

    \label{table:search_space_imagenet}
    \resizebox{0.65\textwidth}{!}{
        \begin{tabular}{|c|c|c|c|c|c|c|}
        \hline
       Search Space & Depth & Stage 1 & Stage 2 & Stage 3 & Stage 4 & Stage 5  \\
        \hline
        RNAS-CL-IS   & 3-3-3     & \begin{tabular}[c]{@{}l@{}}28, 24, \\ 20, 16\end{tabular} & \begin{tabular}[c]{@{}l@{}}40, 36, \\ 32, 28\end{tabular} &
        \begin{tabular}[c]{@{}l@{}}96, 88, 80, \\ 72, 64, 56, \\ 48\end{tabular}
        & & \\ \hline

        RNAS-CL-IM   & 3-3-3-4   & \begin{tabular}[c]{@{}l@{}}28, 24, \\ 20, 16\end{tabular} & \begin{tabular}[c]{@{}l@{}}40, 36, \\ 32, 28\end{tabular} &
        \begin{tabular}[c]{@{}l@{}}96, 88, 80, \\ 72, 64, 56, \\ 48\end{tabular} &
        \begin{tabular}[c]{@{}l@{}}128 120, 108, \\ 100, 92, 84, \\ 76, 68\end{tabular} & \\ \hline

        RNAS-CL-I   & 3-3-3-4-4 & \begin{tabular}[c]{@{}l@{}}28, 24, \\ 20, 16\end{tabular} & \begin{tabular}[c]{@{}l@{}}40, 36, \\ 32, 28\end{tabular}&
        \begin{tabular}[c]{@{}l@{}}96, 88, 80, \\ 72, 64, 56, \\ 48\end{tabular}
        & \begin{tabular}[c]{@{}l@{}}128 120, 108, \\ 100, 92, 84, \\ 76, 68\end{tabular} & \begin{tabular}[c]{@{}l@{}}216, 208, 200, \\ 192, 184,176, \\ 168, 160, 152, \\ 144,136, 128, \\ 120, 108\end{tabular}  \\ \hline

        RNAS-CL-IL   & 1-2-2-4-3 & \begin{tabular}[c]{@{}l@{}}28, 24, \\ 20, 16\end{tabular} & \begin{tabular}[c]{@{}l@{}}40, 36, \\ 32, 28\end{tabular}&
        \begin{tabular}[c]{@{}l@{}}96, 88, 80, \\ 72, 64, 56, \\ 48\end{tabular}
        & \begin{tabular}[c]{@{}l@{}}128 120, 108, \\ 100, 92, 84, \\ 76, 68\end{tabular} & \begin{tabular}[c]{@{}l@{}}216, 208, 200, \\ 192, 184,176, \\ 168, 160, 152, \\ 144,136, 128, \\ 120, 108\end{tabular}  \\
        \hline
        \end{tabular}
    }
\end{table*}


\section{Architecture}
\label{sec:architecture-supp}
In this section, we discuss architectures for various proposed supernets used in RNAS-CL for the CIFAR-10 and ImageNet-100 datasets. Table \ref{table:search_space_cifar} describes the supernets used for CIFAR-10. We use supernets with three blocks. Super-nets used for ImageNet-100 are described in Table \ref{table:search_space_imagenet}. For ImageNet-100, the number of blocks varies from 3 to 5.

%\vspace{-.1in}
\begin{figure}[!htbp]
\begin{center}
    \includegraphics[width=0.35\textwidth] {Images/fbnetv2.eps}
\end{center}
\caption{Illustration of searching for the neural architecture of each layer of student model using the searching mechanism in FBNetV2. $\left\{g_{w}^{(i)}\right\}$ represents gumbel weights associated with different filter choices.}
\label{fig:fbnet_v2}
\end{figure}
\section{Architecture Search by FBNetV2}
\label{sec:architecture-search-fbnetv2}
RNAS-CL builds both an efficient and adversarially robust deep learning model. In this work, we use the training paradigm of FBNetV2 to search for efficient models. In Figure \ref{fig:fbnet_v2}, we illustrate the searching process for neural architecture at a single convolution layer. Each filter choice is attached with a Gumbel weight. These Gumbel weights are optimized to select an efficient model.


\section{More Experimental Results}
\label{sec:more-results-supp}






\subsection{Comparison against KD Variants}
\label{subsection:comparison_kd}

In this section, we evaluate our methods in comparison to a variety of knowledge distillation (KD) techniques as outlined in \citep{park2019relational, ahn2019variational, tung2019similarity, tian2019crd, passalis2018learning}. We utilize Robust WRT-34 as the teacher model across all KD methods and train three distinct student architectures: RNAS-CL-S3, RNAS-CL-S5, and RNAS-CL-S7. In Figure \ref{fig:comparison_kd_varients}, models trained under our paradigm are clearly positioned in the upper right-most part of the graph, underscoring the effectiveness of our intermediate cross-connections strategy. The RNAS-CL-S3 architecture, when trained using Relational Knowledge Distillation (RKD), demonstrates performance comparable to that achieved through our method. Beyond this, all models trained using the RNAS-CL approach significantly surpass other methods in both clean and adversarial accuracy, highlighting the robustness and efficiency of our training strategy.

   \begin{figure}[!ht]
        \begin{center}
            \includegraphics[scale=0.55, trim=0 0 0 0] {Images/comparison_against_KL_varients.jpeg}
        \end{center}

        \caption{The figure compares various knowledge distillation variants (Similarity \citep{tung2019similarity}, VID \citep{ahn2019variational}, RKD \citep{park2019relational}, CRD \citep{tian2019crd}, PKD \citep{passalis2018learning}) against RNAS-CL on the CIFAR-10 dataset. Adversarial Accuracy represents top-1 Accuracy on images perturbed by 20 step PGD attack. Clean Accuracy represents top-1 Accuracy on clean images. Larger marker size indicates larger architecture. For each method, RNAS-CL-S3, RNAS-CL-S5, and RNAS-CL-S7 are represented by increasing marker size.}
        \label{fig:comparison_kd_varients}
\end{figure}

\subsection{Compare CIFAR-10 model against CW and AutoAttack}
\label{subsection:cw_aa}
In this section, we evaluate the performance of RNAS-CL and the approach described by \citep{huang2021exploring} against recent adversarial attacks, specifically $\textup{CW}\infty$ \citep{carlini2017towards} and AutoAttack \citep{croce2020reliable}, using the CIFAR-10 dataset. The $\textup{CW}\infty$ attacks, originally designed to overcome defensive distillation, are implemented here in their $\ell_\infty$ variant, optimized using PGD with a maximum perturbation budget of $\epsilon=8/255$. AutoAttack, known for being a parameter-free ensemble attack, is currently regarded as one of the most robust and reliable benchmarks for evaluating adversarial defenses. The comparative results are shown in Table \ref{table:Compare_cw_autoattack}, showcasing how each model withstands these rigorous testing methods.


\begin{table}[!htbp]
% \setlength{\abovecaptionskip}{0.1mm}
\centering
\caption{Comparison between the performance of \citep{huang2021exploring}  and RNAS-CL against $\textup{CW}_\infty$ \citep{carlini2017towards} and AutoAttack \citep{croce2020reliable} on the CIFAR-10 dataset.}
\label{table:Compare_cw_autoattack}
%\resizebox{0.8\linewidth}{!}{
    \begin{tabular}{|c|c|c|}
    \hline
        Method & $\textup{CW}_{\infty}$ & AA \\ \hline
        VGG-R \citep{huang2021exploring}  & 46.49 &  38.44 \\
        DN-121-R \citep{huang2021exploring}  & 53.07 & 47.75 \\
        RNAS-CL-S3-WRT-34 (Our) & 47.07  & 37.17\\
        RNAS-CL-S5-WRT-34 (Our) & 48.33 & 39.28\\
        RNAS-CL-S7-WRT-34 (Our) & 47.91 & 38.36\\
        RNAS-CL-M-WRT-34 (Our)  & \textbf{53.52} & 46.89\\
        RNAS-CL-L-WRT-34 (Our)  & 52.63 & \textbf{48.49}\\ \hline
    \end{tabular}
%}
\end{table}



\subsection{Results for ImageNet}
\label{subsection:ImageNet}
In this section, we compare our model against the SOTA compact and efficient method \citep{huang2021exploring}, which is known to achieve one of the best PGD accuracies using a compact and efficient model on ImageNet. In Table \ref{table:Compare_NAS_ImageNet}, we evaluate RNAS-CL and \citep{huang2021exploring} against 10 step PGD attack with $\epsilon=4/255$ on the ImageNet dataset. Both models are adversarially trained using FastAT \citep{wong2020fast}. Next, we train RNAS-CL with FastAT and CAC to further increase the robustness.
RNAS-CL models significantly outperform \citep{huang2021exploring} in all three attributes: clean accuracy, robust accuracy, and the number of parameters.

\begin{table*}[!htbph]
\center
\begin{center}
\caption{Performance of various efficient and robust methods on the ImageNet dataset. Clean and PGD are the same as that in Figure~\ref{fig:comparison_plot}. $*$ represents approximate values.}
\label{table:Compare_NAS_ImageNet}
\resizebox{0.7\linewidth}{!}{
\begin{tabular}{|c|c|c|c|c|c|}
\hline
    Method & Objective & Clean & PGD$^{10}$ & Params (M) & GFLOPs \\
    \hline
    ResNet-50-R \citep{huang2021exploring} & FastAT & 56.63 & 31.14 & 25.5 & 4$^*$\\

    RNAS-CL-IL-WRT-50 & FastAT & 61.7 & 32.5 & 8.5 & 0.35\\

    RNAS-CL-IL-WRT-50 & FastAT + CAC & 61.5 & 33.5 & 8.5 & 0.35\\

\hline
\end{tabular}
}
\end{center}
\end{table*}


\subsection{Ablation Study}
\label{subsection:ablation_study}

\begin{figure*}
\begin{center}
\includegraphics[width=0.95\linewidth] {Images/Attention_Map.eps}
\end{center}
\caption{(A) KL-I-R-50 represents attention maps from a model trained using cross-entropy loss and knowledge distillation without any cross-layer connections. Teacher and RNAS-CL represent attention maps from the robust teacher (ResNet-50) and RNAS-CL model. The name for each RNAS-CL layer includes its connected teacher layer. For example, in the $0$-th layer (13), 13 represents the corresponding teacher layer. RNAS-CL drives attention maps from student layers closer to their corresponding teacher layers. (B) Robustness evaluation under different perturbation sizes for PGD and FGSM attacks on CIFAR-10.}
        \label{fig:ablation_study}
\end{figure*}

\begin{table*}[hbt!]
% \setlength{\abovecaptionskip}{0.1mm}
\centering
\caption{Ablation study on various components used during RNAS-CL training on the CIFAR-10 dataset with RNAS-CL-S7-WRT-34 as the base model. CE represents models trained using Cross-Entropy Loss. CE + KL represents models trained by minimizing the Cross-Entropy loss and standard KL Divergence with a robust teacher model. CE + ICC represents models trained by minimizing the Cross-Entropy loss and Intermediate Cross-Connections (ICC). Clean and PGD are the same as that in Figure~\ref{fig:comparison_plot}.}

\label{table:ablation_study}
\resizebox{0.6\linewidth}{!}{
    \begin{tabular}{|c|c|c|c|}
    \hline
        Training Type & Objective Function & Clean & PGD$^{20}$ \\ \hline

        \multirow{ 4}{*}{Without Adversarial training} & CE & 90.98 & 19.3 \\
         & CE + KL & 90.76 & 36.3 \\
         & CE + ICC & 90.33 & 35.54 \\
         & CE + KL + ICC & 90.62 & 37.24 \\ \hline

        \multirow{ 4}{*}{With Adversarial training} & CE & 80.85 & 39.67 \\
         & CE + KL & 85.07 & 41.63 \\
         & CE + ICC & 82.45 & 41.03 \\
         & CE + KL + ICC & 85.06 & 43.88 \\ \hline
    \end{tabular}
}
\end{table*}

\begin{figure*}
        \begin{center}
            \includegraphics[width=0.8\linewidth] {Images/ablation_study_rnas_cl}
        \end{center}

        \caption{Adversarial accuracy of various models at various perturbation budgets on CIFAR-10.}
        \label{fig:ablation_study_cifar_plot}
\end{figure*}


This ablation study demonstrates the significance of student-teacher cross-layer connections in RNAS-CL. We compare three types of training paradigms. In the first training paradigm, we conduct searching and training using cross-entropy loss without any teacher model. We refer to this as standard. In the second paradigm, we conduct searching and training by minimizing the cross-entropy loss and standard KL Divergence with a robust teacher model. We refer to the corresponding models as KL-X-T, where X represents the search space, and T represents the robust teacher model.
% In the third paradigm, we search and train using cross-entropy loss and intermediate cross-connections (ICC).We refer to the corresponding models as ICC-X-T.
Finally, the third model type is RNAS-CL, where we include all three terms, cross-entropy loss, KL Divergence, and cross-layer student-teacher connections.

In Figure \ref{fig:ablation_study}(A), we compare the attention maps from student models trained using RNAS-CL-I-R-50 against students trained using KL-I-R-50. We compare attention maps for various convolution layers at regular intervals. As expected, adding cross-layer connections obtains attention maps from the student model closer to the teacher model. Each student layer learns where to pay attention from its connected teacher layer. For example, in column (b), the KL-I-R-50 layer attends to various parts of the image, whereas the RNAS-CL layer learning from the $28$-th teacher layer pays more attention to the informative central part of the image. Similarly, in column (c), the RNAS-CL layer learns from the teacher model to pay more attention to the central and upper portions of the image. In Table \ref{table:ablation_study}, we compare the performance of various components of RNAS-CL. We observe that under both training schemes, KL and ICC (Intermediate Cross-Connections) significantly increase the robustness compared to the standard network. Finally, combining KL and ICC, that is, RNAS-CL, outperforms its counterparts. In Figure \ref{fig:ablation_study_cifar_plot}, we compare RNAS-CL models against KL-X-T and standard models against PGD attacks at various perturbation budgets on the CIFAR-10 dataset.





\end{document}
