% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% my package
\usepackage{graphicx}
\usepackage{xspace}
\usepackage[ruled]{algorithm2e}
\usepackage{colortbl}
\definecolor{lightgray}{gray}{0.9}
\usepackage{cleveref}
\usepackage{hyperref}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{bm}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\def\eg{\textit{e.g.}}
\def\ie{\textit{i.e.}}
\def\etal{\textit{et al.}}
\def\vx{{\bm{x}}}
\DeclareMathOperator*{\argmin}{arg\,min}

% \title{EntProp: High Entropy Propagation via Auxiliary Batch Normalization Layers}
\title{EntProp: High Entropy Propagation for Improving Accuracy and Robustness}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<alesana882@gmail.com>?Subject=Your UAI 2024 paper}{Shohei Enomoto}}

% Add affiliations after the authors
\affil[1]{%
    NTT\\
    Tokyo, Japan
}
  
\begin{document}
\maketitle

\begin{abstract}
Deep neural networks (DNNs) struggle to generalize to out-of-distribution domains that are different from those in training despite their impressive performance.
In practical applications, it is important for DNNs to have both high standard accuracy and robustness against out-of-distribution domains.
One technique that achieves both of these improvements is disentangled learning with mixture distribution via auxiliary batch normalization layers (ABNs).
This technique treats clean and transformed samples as different domains, allowing a DNN to learn better features from mixed domains.
However, if we distinguish the domains of the samples based on entropy, we find that some transformed samples are drawn from the same domain as clean samples, and these samples are not completely different domains.
To generate samples drawn from a completely different domain than clean samples, we hypothesize that transforming clean high-entropy samples to further increase the entropy generates out-of-distribution samples that are much further away from the in-distribution domain.
On the basis of the hypothesis, we propose high entropy propagation~(EntProp), which feeds high-entropy samples to the network that uses ABNs.
We introduce two techniques, data augmentation and free adversarial training, that increase entropy and bring the sample further away from the in-distribution domain.
These techniques do not require additional training costs.
Our experimental results show that EntProp achieves higher standard accuracy and robustness with a lower training cost than the baseline methods.
In particular, EntProp is highly effective at training on small datasets.
\end{abstract}

\section{Introduction}
% background
Deep neural networks (DNNs) have achieved impressive performance in a variety of fields, such as computer vision, natural language processing, and speech recognition.
However, DNNs are susceptible to accuracy degradation when presented with data distributions that deviate from the training distribution.
This is a common occurrence in outdoor environments, such as autonomous driving and surveillance cameras, due to variations in weather and brightness~\citep{diamond2021dirty,in-c,Zendel_2018_ECCV}.
As a result, while standard accuracy is essential for DNNs, robustness against distribution shifts is equally important.

% existing method
Various techniques have been proposed to improve robustness against out-of-distribution domains~(\eg, domain adaptation~\citep{saenko2010adapting,ganin2015unsupervised,tzeng2015simultaneous}), many of which usually decrease the standard accuracy.
One technique to improve both standard accuracy and robustness is disentangled learning with mixture distribution using a dual batch normalization (BN) layer~\citep{advprop,fast_advprop,mixprop,augmax}.
This technique prepares an auxiliary BN layers (ABNs) in addition to the main BN layers (MBNs).
It feeds the clean samples and the samples transformed by adversarial attacks or data augmentation to the same network but applied with different BNs, \ie, use the MBNs for the clean samples and use the ABNs for the transformed samples.
% The MBN-applied network trains clean samples, and the ABN-applied network trains samples transformed by adversarial attacks or data augmentation.
The distinction of the BNs used to train samples of different domains prevents mixing of the BN layer statistics and the affine parameters~\citep{zhang2023a}, allowing the MBN-applied network to learn better from the features of both the out-of-distribution and in-distribution domains~\citep{advprop}.
Furthermore, since only MBNs are used during inference, there is no increase in computational cost in test-time.

% RQ
Existing studies treat clean and transformed samples as different domains; however, it is not clear whether these samples are entirely different domains.
It is clear that clean samples are in-distribution domain.
The transformed samples can be divided into two groups: those that are highly transformed and those that are less transformed.
Therefore, we have the following research questions: \textit{Do transformed samples include samples drawn from both the in-distribution and out-of-distribution domains?}

% proposed method
As a first step in answering this question, we consider distinguishing between the in-distribution and out-of-distribution samples.
Since adversarial attacks and data augmentation are transformations that increase the diversity and hardness of samples~\citep{augmax}, we verify the distinction of domains using an uncertainty metric, entropy.
\Cref{fig:mixadvprop_ent_epochs} shows the entropy of clean and transformed samples when training the network with baseline methods.
The results show that some of the clean samples with high entropy overlap with the entropy of the transformed samples.
Since clean high-entropy samples are already similar to out-of-distribution samples, we hypothesize that applying entropy-increasing transformations to clean high-entropy samples generates out-of-distribution samples that are much further away from the in-distribution samples.
From this hypothesis, we propose high entropy propagation (EntProp), which trains ABN-applied network with high-entropy samples.
First, a network trains clean samples using MBNs and calculates entropy.
Then, for the high-entropy samples in the clean samples, a network trains using ABNs.
At this time, to further increase the entropy of the samples and bring them further away from the in-distribution domain, we introduce two techniques, data augmentation and free adversarial training~\citep{free_adv}.
These techniques have no additional training cost and allow for further accuracy gains.
% Furthermore, we add a PGD attack~\citep{pgd} to EntProp as well as AdvProp~\citep{advprop}, which increases the computational cost but also increases accuracy.

\begin{figure}[tb]
\centering
\includegraphics[width=1.0\linewidth]{figure/Mix_AdvProp_entropy_epochs.pdf}
\caption{
Entropy per epoch when ResNet-18 is trained with MixProp~\citep{mixprop} (left) and AdvProp~\citep{advprop} (right) on the CIFAR-100 dataset.
Error bars indicate one standard deviation, and lines indicate average.
}
\label{fig:mixadvprop_ent_epochs}
\end{figure}

% experiment
We evaluated EntProp on five widely used image classification datasets with several DNN architectures.
We show that simply training ABN-applied network on clean high-entropy samples improves both standard accuracy and robustness even though it does not use adversarial attacks or data augmentation.
EntProp, which includes two entropy-increasing techniques, shows higher accuracy at a lower training cost than baseline methods.
Furthermore, we show that on the small dataset, the use of adversarial training on all samples leads to overfitting, which can be resolved by effective undersampling, such as EntProp.

% contributions
The contributions of this paper are as follows:
\begin{itemize}
\item We propose a novel disentangled learning method via ABNs that distinguishes sample domains based on entropy.
We show that training ABN-applied network on high-entropy samples improves both standard accuracy and robustness.
\item We introduce two techniques, data augmentation and free adversarial training, which further increase sample entropy and model accuracy without training cost.
\item Our extensive experiments show that EntProp achieves better standard accuracy and robustness than baseline methods, despite its lower training cost.
We show that on small datasets, using all samples for adversarial training leads to overfitting, while undersampling methods such as EntProp prevent overfitting, benefit from adversarial training, and improve accuracy.
% \item We propose EntProp, a novel disentangled learning method via ABNs.
% We treat high-entropy samples as out-of-distribution domain, and introduce two techniques to further separate them from the in-distribution domain.
% \item Our experiments show that EntProp achieves better standard and robustness than baseline methods, despite its lower training cost.
% \item We demonstrate that using all samples for adversarial training on small datasets leads to overfitting and lower accuracy than vanilla training. 
% Undersampling methods such as EntProp prevent overfitting, benefit from adversarial training, and improve accuracy.
\end{itemize}

\section{Related Work}
Adversarial attacks~\citep{fgsm,pgd} cause DNNs to make wrong predictions by adding human imperceptible perturbations to input sample.
To defend against such attacks, a variety of methods~\citep{clp,trades,mart} have been proposed to train DNNs with adversarial samples, also known as adversarial training.
However, adversarial training has a trade-off~\citep{tsipras2018robustness,ilyas2019adversarial} between accuracy on clean samples and robustness to adversarial attacks, compromising accuracy on clean samples to achieve high robustness.
The reason for this trade-off was thought to be that the two domains are learned simultaneously by a single DNN, motivated by the two-domain hypothesis~\citep{xie2019intriguing} that clean and adversarial samples are drawn from different domains.
Based on this hypothesis, \citet{xie2019intriguing} showed that using MBNs for clean samples and ABNs for adversarial samples avoids mixing the statistics and affine parameters of BN layers~\citep{zhang2023a} by two different domains and achieves high accuracy for the domain for which each BN layer is trained.
AdvProp~\citep{advprop} showed that disentangled learning for a mixture of distributions via ABNs allows DNNs with MBNs to learn more effectively from both adversarial and clean samples, improving the standard accuracy and the accuracy for the out-of-distribution domain.
AdvProp is simple and highly practical, and has since been developed in various ways.
Fast AdvProp~\citep{fast_advprop} reduced the number of samples and iterations required for adversarial attacks, resulting in the same computational cost as vanilla training, with higher accuracy.
Disentangled learning via ABNs showed effectiveness not only using adversarial attacked samples, but also using data augmented samples~\citep{merchant2020does,mixprop,augmax} and style transferred samples~\citep{shape_texture}.
Furthermore, AdvProp was proposed for various applications, including object detection tasks~\citep{det_advprop}, contrastive learning~\citep{jiang2020robust,ho2020contrastive}, and training vision transformers~\citep{pyramid}.

Although these studies treat clean and transformed samples as different domains, we argue that some of these samples overlap in domain.
We train the MBN-applied network with in-distribution domain samples and the ABN-applied network with high-entropy samples as the out-of-distribution domain.

% BN
% Adv
% Uncertainty

\section{Proposed Method}
In this section, we describe our method, high entropy propagation (EntProp), for effective disentangled learning with mixture distribution via ABNs.
\subsection{Motivation}
Baseline methods treat clean samples as the in-distribution domain and samples transformed by adversarial attacks~\citep{advprop,fast_advprop,xie2019intriguing} or data augmentation~\citep{mixprop,merchant2020does} as the out-of-distribution domain, and distinguish the BNs used for these samples.
Although it is clear that clean samples are the in-distribution domain, we question that transformed samples are the out-of-distribution domain.
In the transformed samples, some samples are significantly affected by the transformation and are further away from the in-distribution domain, while some samples are less affected and closer to the in-distribution domain.
Because the MixUp and PGD attack used by MixProp and AdvProp are both sample transformations that increase entropy, we use entropy as the initial investigation to distinguish the distributions.
If the distribution is distinguished by entropy as shown in \Cref{fig:mixadvprop_ent_epochs}, some samples in the clean and transformed samples have overlapping domain, which may prevent effective disentangled learning via ABNs.
% hypothesis
Since clean high-entropy samples are in the same domain as the transformed out-of-distribution samples, we hypothesize that transforming these samples to increase entropy generates out-of-distribution samples that are significantly different from the in-distribution samples.
On the basis of the hypothesis, we propose EntProp, which trains the ABN-applied network on high-entropy samples.
% Therefore, based on the fact that adversarial attack and data augmentation are transformations that increase entropy, we focus on entropy as a measure for the distance of a sample from the in-distribution domain.
% We hypothesize that the success of baseline methods is due to the feeding of high-entropy samples to the ABNs, and propose EntProp, which feeds only high-entropy samples in the mini-batch to the ABNs.
% method
% We show the algorithm.

\subsection{Methodology}
Here, we describe the process of one iteration of EntProp training.
We assume a network with ABNs in addition to the MBNs.
\Cref{fig:overview} shows the overview of EntProp and baseline methods, and \Cref{algo:entprop} shows the pseudo-code of EntProp.
EntProp consists of three components, \textit{Sample Selection}, \textit{Data Augmentation} and \textit{Free Adversarial Training}, which we will detail subsequently.

\paragraph{Sample Selection.}
First, the MBN-applied network outputs prediction $\hat{p}(y|x)$ for the class label $y$ from clean sample $x$.
From the prediction, we compute the loss and entropy.
\begin{align}
H = \sum_{y=1}^C -\hat{p}(y|x) \log \hat{p}(y|x),
\label{eq:entropy}
\end{align}
where $H$ is entropy and $C$ is the number of classes.
Next, we feed the top $k|\mathcal{B}|$ samples of high-entropy samples to the ABN-applied network to compute the loss, where $k \in [0, 1]$ is a hyperparameter and $|\mathcal{B}|$ is the batch size.
Finally, we update the network parameters from the gradient to minimize total loss.
% + low-cost high entropy 
Furthermore, based on our hypothesis, we introduce two techniques that increase the entropy of samples without additional training cost: data augmentation and free adversarial training.

\paragraph{Data Augmentation.}
Data augmentation is the most common technique widely used when training DNNs that improves the accuracy of DNNs by transforming samples and increasing diversity and hardness.
Since most data augmentations use simple transformations, the computational cost is negligible compared to training DNNs.
We use the popular data augmentation technique, MixUp~\citep{mixup}, to increase the entropy of the samples.
MixUp linearly combines two samples in a mini-batch and increases entropy because the combined sample has two labels.
Unlike MixProp~\citep{mixprop}, we treat augmented samples as in-distribution domain and train MBN-applied network from the augmented samples for the calculation of loss and entropy.
Since MixUp improves standard accuracy, samples transformed by MixUp retain sufficient information about the in-distribution domain.
Furthermore, MixUp eliminates the high-entropy sample selection bias in each iteration, allowing the ABN-applied network to train a diversity of samples (see \Cref{sec:bias} for details).
The MixUp loss function is defined as:
\begin{align}
\label{eq:mixup_loss}
L^{m} = \lambda L^{c}(\theta, \vx^{m}, y^{a}) + (1 - \lambda) L^{c}(\theta, \vx^{m}, y^{b}),
\end{align}
where $L^{c}$ is the cross-entropy loss, $\theta$ is the network parameter, $\lambda$ is the mixing coefficient, $\vx^{m}$ is the mixed samples, and $y^{a}$ and $y^{b}$ are the labels of the samples before mixing.
If EntProp does not use MixUp, the MBN-applied network trains $L^{c}$ for clean samples.

\paragraph{Free Adversarial Training.}
\citet{free_adv} generates adversarial examples by reusing the gradients used for training in the previous iteration.
We use this technique to generate adversarial examples $\vx^{a}$ for high-entropy samples.
EntProp first calculates the loss to clean or augmented samples with MBN-applied network, allowing the generation of free adversarial examples from the gradient at this time.
Note that it is not optimal to use the MBN-applied network gradient to generate an adversarial attack on the ABN-applied network.
When we use augmented samples, we use the gradient obtained from the augmentation loss to generate an adversarial example.
% We generate adversarial examples $\vx^{a}$ only from high-entropy samples.
In the case of multiple iterations for the attacker, as in a Projected Gradient Descent~(PGD)~\citep{pgd} attack, the first one has no computational cost, but the subsequent ones have the same computational cost as a standard adversarial attack and are generated from the gradient of the ABN-applied network.
For the PGD attack, we set perturbation size $\epsilon$ to $n+1$ and attack step size $\alpha$ to $1$, where $n$ is the number of iterations for the attacker.
If the number of iterations is $1$, then $\epsilon$ is set to $1$.

% % overall 
% The overall loss function is defined as:
% \begin{align}
% \label{eq:total_loss}
% Loss = L^{m} + L^{c}(\theta, \vx^{a}, y),
% \end{align}
% where $\vx^{a}$ is the attacked high-entropy sample.


% cost
\subsection{Training Cost}
Here, we consider the training cost of one epoch.
We denote the cost of a single forward and backward pass for a single sample as $1$ and the size of the dataset as $N$. 
The cost of vanilla training for one epoch is $N$. 
EntProp first uses the clean mini-batch, then $k|\mathcal{B}|$ samples of the mini-batch, thus the cost is $(1+k)N$.
The computational cost of data augmentation and free adversarial training~($n=1$) is negligible compared to the computational cost of forward and backward passes, thus using them does not change the overall training cost.
If we increase the iteration number $n$ of the adversarial attack by more than $1$, it cost us an additional $k(n-1)N$.
Consequently, the training cost of EntProp is $(1+kn)N$.
\Cref{tab:train_cost} shows the training cost per epoch for baseline methods and EntProp.

\begin{table}[tb]
\centering
\caption{
Training costs for each method.
$p_{\mathrm{adv}}$ is the Fast AdvProp hyperparameter that determines the sample percentage used for adversarial attack.
}
\label{tab:train_cost}
\resizebox{\columnwidth}{!}{
\begin{tabular}{l|cccccc}
\toprule
              & Vanilla & AdvProp & Fast AdvProp & MixProp & EntProp \\ \midrule
Training Cost & $\mathrm{N}$       & $(2+n)\mathrm{N}$     & $(1+p_{\mathrm{adv}})\mathrm{N}$         & $2\mathrm{N}$      &  $(1+kn)\mathrm{N}$              \\ \bottomrule
\end{tabular}
}
\end{table}
% \vspace{-0.9em}

\begin{algorithm}[t!]
% \small
\DontPrintSemicolon
\KwData{A set of clean samples with labels;}
\KwResult{Network parameter $\theta$;}
\For{each training step}
{
Sample a clean mini-batch $\vx$ with label $y$;\\
Generate the corresponding augmented mini-batch $\vx^{m}$ and labels $y^{a}$ and $y^{b}$; \\
Compute loss $L^{m}$ and entropy $H$ on augmented mini-batch using the MBNs from \Cref{eq:entropy,eq:mixup_loss}; \\
Obtain the gradient $\nabla \leftarrow \nabla_{\vx^{m}}$; \\
Get the top$k|\mathcal{B}|$ samples $\vx^{a}$ with the highest entropy from augmented mini-batch; \\
$\delta \leftarrow 0$; \\
\For{$i=1, \dots, n$}{
$\delta \leftarrow \delta + \epsilon \cdot sign(\nabla)$; \\
$\vx^{a} = \vx^{a} + clip(\delta, -\epsilon, \epsilon)$; \\
Compute loss $L^c(\theta, \vx^{a}, y)$ on adversarial sample using the ABNs;\\
Obtain the gradient $\nabla \leftarrow \nabla_{\vx^{a}}$; \\
}
% Generate the corresponding adversarial sample $\vx^{a}$ using the gradient $\nabla_{\vx^{m}}$; \\
% Compute loss $L^c(\theta, \vx^{a}, y)$ on adversarial sample using the ABNs;\\
Minimize the total loss w.r.t. network parameter $\argmin\limits_{\theta} L^{m} + L^{c}(\theta, \vx^{a}, y)$.
}
\KwRet{$\theta$}
\caption{Pseudo code of EntProp}
\label{algo:entprop}
\end{algorithm}
\vspace{-0.9em}


\begin{figure}[tb]
\centering
\includegraphics[width=1.0\linewidth]{figure/overview.pdf}
\caption{
Overview of baseline methods (left) and EntProp (right).
The baseline methods feed clean samples to MBN and transformed samples to ABN.
EntProp treats the augmented sample as in-distribution domain and feeds it to MBN.
EntProp then adversarial attacks high-entropy samples and feeds it to ABN.
% EntProp feeds clean samples to MBN and transforms clean high-entropy samples and feeds them to ABN.
}
\label{fig:overview}
\end{figure}

\section{Experiments}
In this section, we experimented on five widely used image classification datasets and show the effectiveness of EntProp. 

\subsection{Experiments Setup}

\subsubsection{Datasets}
To extensively evaluate the effectiveness of EntProp and baseline methods, we measure performance across the following five datasets that are widely used for image classification benchmark: CIFAR-100~(C100)~\citep{cifar}, CUB-200-2011~(CUB)~\citep{cub}, OxfordPets~(Pets)~\citep{pets}, StanfordCars~(Cars)~\citep{cars} and ImageNet (IN)~\citep{imagenet}.
We provide details on each dataset in the Appendix.

\subsubsection{Comparison Methods.}
We compared the four baseline methods with EntProp.
\begin{itemize}
\item \textbf{Vanilla.}
Vanilla training for network without ABNs.
\item \textbf{AdvProp.}
AdvProp feeds the clean samples and the adversarial samples to the same network but applied with different BNs.
We used PGD as the attacker to generate adversarial samples.
We set the perturbation size $\epsilon$ to $4$. %1 for CIFAR-100 and 4 otherwise. 
The number of iterations for the attacker is $n=5$ and the attack step size is $\alpha=1$.
\item \textbf{Fast AdvProp.}
Fast AdvProp speeds up AdvProp by reducing the number of iterations for PGD attacker and the percentage of training samples used as adversarial examples.
We set the percentages of training samples used as adversarial examples to $p_{adv}=0.2$, the perturbation size $\epsilon$ to 1, the number of iterations for the attacker to $n=1$, and the attack step size to $\alpha=1$.
\item \textbf{MixProp.}
MixProp feed the clean samples and the augmented samples with MixUp to the same network but applied with different BNs.
The parameter of the beta distribution used for MixUp is set to 0.2 for ImageNet and 1 otherwise.

% \paragraph{EntProp.}
% To keep the training cost the same as for Fast AdvProp, we set the EntProp hyperparameter, $k=0.2$.
% When we used PGD attack, we set perturbation size $\epsilon=n+1$ and attack step size $\alpha=1$.
% Note that when $n=1$, $\epsilon$ is set to 1.
% k=0.4, mixup alpha, PGD, 
% eps=n+1, except n=1
\end{itemize}

\subsubsection{Implementation Details}
For the C100 experiments, we trained DNNs for 200 epochs with a batch size of 128, using SGD with momentum of 0.9 and weight decay of 0.0005.
The learning rate started with 0.1 and decreased by cosine scheduler.
For the CUB, Pets, and Cars experiments, we fine-tuned the DNNs, which were pre-trained~\citep{torchvision} on the ImageNet dataset, using Adam~\citep{adam} optimizer. 
Since the pretrained DNNs do not have ABNs, we set the initial weights of the ABNs to be the same as those of the MBNs.
We fine-tuned networks with batch size of 64 for 100 epochs with weight decay of 0.0005. 
The learning rate started with 0.0001 and decreased by the factor of 0.1 at every 10 epochs.
For the IN experiments, we trained DNNs for 105 epochs with a batch size of 256, using SGD with momentum of 0.9 and weight decay of 0.0005.
The learning rate started with 0.1 and decreased by the factor of 0.1 at every 30 epochs.

In accordance with the the Fast AdvProp setting, we adjust the baseline methods  to align the loss scale with that of vanilla training. 
Furthermore, given that Fast AdvProp and EntProp entail the duplication of training instances within a single iteration, we ensure equitable treatment of all samples through weight normalization.

\subsubsection{Evaluation Metrics.}
We evaluate standard accuracy (SA), the accuracy of a standard test set, and robust accuracy (RA), the average accuracy of an artificially corrupted test set~\citep{in-c}.
Artificial corruptions are the same as those used in ImageNet-C dataset and the corrupted test set consists of 15 types of corruption with five severity levels, and we use the average accuracy of all of them as RA.
Furthermore, to evaluate the balance between SA and RA, we define the harmonic mean as our evaluation metric.
% \footnote{Gaussian noise, Shot noise, Impulse noise, Defocus blur, Glass blur, Motion blur, Zoom blur, Snow, Frost, Fog, Brightness, Contrast, Elastic transform, Pixelate, JPEG.}
\begin{align}
\label{eq:h_score}
\mathrm{H_{score}} = \frac{2 \mathrm{SA} \cdot \mathrm{RA}}{\mathrm{SA} + \mathrm{RA}}.
\end{align}

$\mathrm{H_{score}}$ is high only when both SA and RA are high.

\subsection{Main Experiments}
In this section, we show the effectiveness of EntProp.
% For a fair comparison, all methods used MixUp during training.
% All methods except MixProp trained the MBN-applied network on the augmented samples.
% Fast AdvProp and AdvProp generated adversarial examples from clean samples following the Fast AdvProp setting.
We describe more detailed experimental results in the Appendix.
All experiments, except the ImageNet experiment, were performed three times, and we report the average values.
The best and second results are \textbf{bolded} and \underline{underlined}.

\subsubsection{Comparison Results}
\Cref{tab:comparison} shows the comparison results between EntProp and the baseline methods.
EntProp~($k=0.2, n=1$) and EntProp~($k=0.2, n=5$) have the same training cost as Fast AdvProp and MixProp, respectively, but outperform $\mathrm{H_{score}}$.
EntProp~($k=0.6, n=5$) has a much lower training cost than AdvProp, but shows the highest $\mathrm{H_{score}}$ for all datasets except Cars.
These results indicate that EntProp allows for more efficient training by bringing the samples fed to the ABN-applied network further away from the in-distribution domain.
For small datasets such as CUB and Pets, EntProp shows particularly large improvement results, while AdvProp shows smaller improvement results.
Because adversarial training requires large datasets~\citep{schmidt2018adversarially}, AdvProp leads to overfitting~\citep{rice2020overfitting} on small datasets, making it difficult to improve performance.
EntProp mitigates overfitting issues by employing efficient entropy-based undersampling techniques, thereby achieving notable accuracy improvements through effective adversarial training strategies.

Moreover, in \Cref{fig:acc_cost} we demonstrate the trade-off between average $\mathrm{H_{score}}$ and training cost across all datasets excluding ImageNet.
EntProp shows a higher $\mathrm{H_{score}}$ with a smaller increase in training cost and a better trade-off than the baseline methods.

\begin{figure}[tb]
\centering
\includegraphics[width=1.0\linewidth]{figure/EntProp_acc_cost.pdf}
\caption{
Average $\mathrm{H_{score}}$ and training cost over all datasets except ImageNet.
We plot the relative values with the vanilla training cost as 1.
}
\label{fig:acc_cost}
\end{figure}


\begin{table*}[tb]
\centering
\caption{
Accuracy and training cost of training ResNet-50 with each method on five datasets.
% The best and second results are \textbf{bolded} and \underline{underlined}.
% Avg. indicates the average of the four networks.
$\dagger$ indicates that it is a number from the original literature.
}
\label{tab:comparison}
\resizebox{\textwidth}{!}{
\begin{tabular}{l|c|ccccccccccccccc}
\toprule
 & Dataset & \multicolumn{3}{c}{C100} & \multicolumn{3}{c}{CUB} & \multicolumn{3}{c}{Pets} & \multicolumn{3}{c}{Cars} & \multicolumn{3}{c}{IN} \\ 
\cmidrule(r){2-2}
\cmidrule(r){3-5}
\cmidrule(r){6-8}
\cmidrule(r){9-11}
\cmidrule(r){12-14}
\cmidrule(r){15-17}
Method & Cost &  SA(\%) & RA(\%) & $\mathrm{H_{score}}$ &  SA(\%) & RA(\%) & $\mathrm{H_{score}}$ & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ &  SA(\%) & RA(\%) & $\mathrm{H_{score}}$ \\ 
\midrule
Vanilla & N & 79.30 & 51.01 & 62.08 & 81.99 & 48.37 & 60.85 & 92.24 & 49.97 & 64.82 & 90.18 & 41.07 & 56.43 & 76.13 & 39.57 & 52.07 \\
MixProp & 2N & \textbf{81.84} & 55.55 & 66.18 & \textbf{83.77} & 56.80 & 67.70 & \textbf{93.01} & 59.84 & 72.82 & \textbf{91.30} & 51.13 & 65.55 & \textbf{77.20} & 41.79 & 54.23 \\
Fast AdvProp & 1.2N & 79.43 & 53.31 & 64.45 & 82.90 & 51.22 & 63.32 & \underline{92.78} & 54.42 & 68.60 & \underline{90.71} & 44.31 & 59.54 & 76.60 & 40.71 & 53.16 \\
AdvProp & 7N & 78.05 & \underline{58.94} & \underline{67.17} & 81.45 & 53.44 & 64.54 & 92.10 & 55.82 & 69.51 & 90.45 & \underline{54.11} & \underline{67.71} & \underline{$77.10^\dagger$} & N/A & N/A \\
\rowcolor{lightgray} EntProp~($k=0.2, n=1$) & 1.2N & 79.99 & 56.07 & 65.93 & 82.92 & 60.18 & 69.74 & 92.23 & 59.28 & 72.18 & 90.48 & 52.93 & 66.79 & 76.29 & 42.70 & 54.75 \\
\rowcolor{lightgray} EntProp~($k=0.6, n=1$) & 1.6N & 80.31 & 57.41 & 66.12 & \underline{83.32} & \underline{62.02} & \underline{71.11} & 92.47 & \underline{62.35} & \underline{74.48} & 90.15 & \textbf{55.66} & \textbf{68.82} & 75.09 & 41.49 & 53.45 \\
\rowcolor{lightgray} EntProp~($k=0.2, n=5$) & 2N & 78.21 & 57.27 & 66.95 & 83.10 & 60.71 & 70.17 & 92.21 & 60.14 & 72.80 & 90.36 & 52.00 & 66.01 & 76.35 & \underline{43.13} & \underline{55.12} \\
\rowcolor{lightgray} EntProp~($k=0.6, n=5$) & 4N & \underline{80.62} & \textbf{60.50} & \textbf{69.12} & 82.65 & \textbf{64.00} & \textbf{72.14} & 92.15 & \textbf{66.75} & \textbf{77.42} & 90.21 & 52.72 & 66.55 & 76.47 & \textbf{44.45} & \textbf{56.22} \\ 
\bottomrule
\end{tabular}
}
\end{table*} 

\subsubsection{Other Distribution Shift Datasets}
Here, we evaluate EntProp on distribution-shifted datasets other than the corrupted dataset.
\Cref{tab:in_other} shows the accuracy of EntProp and baseline methods on the ImageNet variant datasets.
Disentangled learning methods using ABN improve accuracy even under various types of distribution shifts, with EntProp showing the highest accuracy among them.


\begin{table}[tb]
\centering
\caption{
Accuracy on distribution-shifted datasets other than the corrupted dataset when ResNet-50 is trained with each method.
A, R, and Stylized denote ImageNet-A~\citep{in-a}, ImageNet-R~\citep{in-r}, and Stylized-ImageNet~\citep{sty-in}, respectively.
}
\label{tab:in_other}
\begin{tabular}{l|ccc}
\toprule
Method & A & R & Stylized \\
\midrule
Vanilla & 0.00 & 36.17 & 7.18 \\
MixProp & \underline{3.17} & 38.75 & 8.32 \\
Fast AdvProp & 2.19 & 38.17 & 8.17 \\
\rowcolor{lightgray} EntProp~($k=0.2, n=1$) & 2.87 & \underline{39.88} & 9.56 \\
\rowcolor{lightgray} EntProp~($k=0.6, n=1$) & 2.60 & 38.85 & 9.73 \\
\rowcolor{lightgray} EntProp~($k=0.2, n=5$) & 2.89 & 39.85 & \underline{10.69} \\
\rowcolor{lightgray} EntProp~($k=0.6, n=5$) & \textbf{3.29} & \textbf{40.78} & \textbf{10.94}\\
\bottomrule
\end{tabular}
\end{table}

\subsubsection{Other Architectures}
\Cref{tab:other_arch} shows experimental results for several architectures on the CIFAR-100 dataset.
Regardless of architecture, EntProp~($k=0.6, n=5$) consistently shows the highest $\mathrm{H_{score}}$.
% For further detail, we provide experiments of the ResNet and EfficientNet~\cite{efficientnet} families on the CUB-200-2011 dataset in the Appendix.

\begin{table*}[tb]
\centering
\caption{
Accuracy of ResNet-18, WideResNet-50, and ResNeXt-50 trained on the CIFAR-100 dataset.
Avg. indicates the average of the three networks.}
\label{tab:other_arch}
\resizebox{\textwidth}{!}{
\begin{tabular}{l|ccccccccc|ccc}
\toprule
 & \multicolumn{3}{c}{ResNet-18} & \multicolumn{3}{c}{WRN-50} & \multicolumn{3}{c}{ResNeXt-50} & \multicolumn{3}{c}{Avg.} \\
\cmidrule(r){2-4}
\cmidrule(r){5-7}
\cmidrule(r){8-10}
\cmidrule(r){11-13}
Method &  SA(\%) & RA(\%) & $\mathrm{H_{score}}$ &  SA(\%) & RA(\%) & $\mathrm{H_{score}}$ &  SA(\%) & RA(\%) & $\mathrm{H_{score}}$ &  SA(\%) & RA(\%) & $\mathrm{H_{score}}$ \\
\midrule
Vanilla & 78.45 & 49.96 & 61.04 & 79.35 & 51.64 & 62.56 & 80.86 & 52.95 & 63.99 & 79.55 & 51.52 & 62.53 \\
MixProp & \textbf{80.86} & 53.97 & 64.73 & \textbf{82.17} & 56.38 & 66.87 & \textbf{82.37} & 56.97 & 67.36 & \textbf{81.80} & 55.77 & 66.32 \\
Fast AdvProp & 78.89 & 53.31 & 63.63 & 79.69 & 55.25 & 65.25 & 79.30 & 55.31 & 65.17 & 79.29 & 54.62 & 64.68 \\
AdvProp & 75.15 & \underline{56.78} & 64.69 & 77.50 & \underline{59.28} & 67.17 & 78.36 & 59.08 & 67.37 & 77.00 & \underline{58.38} & 66.41 \\
\rowcolor{lightgray} EntProp~($k=0.2, n=1$) & \underline{79.41} & 55.24 & 65.15 & 80.66 & 57.30 & 67.00 & 81.46 & 58.47 & 68.08 & 80.51 & 57.00 & 66.74 \\
\rowcolor{lightgray} EntProp~($k=0.6, n=1$) & 78.89 & 55.86 & \underline{65.40} & \underline{81.3} & 58.95 & 68.34 & \underline{81.75} & \underline{59.28} & \underline{68.72} & \underline{80.65} & 58.03 & \underline{67.49} \\
\rowcolor{lightgray} EntProp~($k=0.2, n=5$) & 79.19 & 54.52 & 64.58 & 81.13 & 58.00 & \underline{67.64} & 81.20 & 58.95 & 68.31 & 80.51 & 57.16 & 66.84 \\
\rowcolor{lightgray} EntProp~($k=0.6, n=5$) & 78.92 & \textbf{57.16} & \textbf{66.30} & 80.77 & \textbf{61.02} & \textbf{69.52} & 81.38 & \textbf{61.35} & \textbf{69.96} & 80.36 & \textbf{59.84} & \textbf{68.59} \\
\bottomrule
\end{tabular}
}
\end{table*}

% \subsubsection{ViT, MAE, GPaCo}
Furthermore, we investigated with the applicability of EntProp to vision transformers (ViT).
We experimented with fine-tuning ViT-Base pre-trained by MAE~\citep{mae} on the CIFAR-100 dataset. 
When applying EntProp to ViT, we add an auxiliary layer normalization layer instead of an auxiliary BN layer.
We show the results in \Cref{tab:vit}.
The results show that EntProp improves the SA and RA of ViT. 
EntProp can be applied to ViT-based architectures and can be used in conjunction with recent methods to improve SA and RA such as MAE.

\begin{table}[tb]
\centering
\caption{
Accuracy of ViT-base pre-trained by MAE and fine-tuned on the CIFAR-100 dataset.
}
\label{tab:vit}
\begin{tabular}{l|ccc}
\toprule
Method & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ \\ \midrule
Vanilla & 89.55 & 70.62 & 78.97 \\
% GPaCo & \textbf{89.73} & 70.23 & 78.79 \\
\rowcolor{lightgray} EntProp~($k=0.5, n=5$) & \textbf{89.56} & \textbf{75.39} & \textbf{81.87} \\
% \rowcolor{lightgray} GPaCo w/EntProp & 89.37 & 70.92 & 79.08 \\ 
\bottomrule
\end{tabular}
\end{table}


\subsubsection{Verification of hypothesis}
To verify our hypothesis, we measured Frechet Inception Distance~(FID), a measure of inter-distributional distance between two datasets. 
We used Vanilla-trained ResNet-18 on the CIFAR-100 dataset. 
We provide FID for the original dataset and the dataset generated by three transformations: MixUp , PGD attack, and Ours (Sample selection $+$ PGD attack $+$ MixUp).
\Cref{tab:fid} shows the results.
Ours shows the most FID increase and results in the pushing original distribution far away from another.
This result supports our hypothesis that transforming samples to increase entropy generates out-of-distribution samples.

\begin{table}[tb]
\centering
\caption{
FID for the original and transformed datasets.
We measured FID using Vanilla-trained ResNet-18 on the CIFAR-100 dataset.
}
\label{tab:fid}
\begin{tabular}{l|c}
\toprule
Transformation & FID \\ \midrule
MixUp & 5.12 \\
PGD attack & \underline{373.93} \\
\rowcolor{lightgray} Ours & \textbf{383.89} \\
\bottomrule
\end{tabular}
\end{table}

\subsubsection{Ablation Study}

First, we do not use data augmentation and free adversarial training, and we confirm the effect of feeding only clean high-entropy samples to the ABN-applied network.
\Cref{fig:ent_vs_rnd} shows the results of sample selection with high entropy versus random selection.
There is little difference when $k$ is small and large, and entropy shows higher $\mathrm{H_{score}}$ than random when $k=0.2$ to $k=0.7$.
Furthermore, $k\geq0.1$ shows a higher $\mathrm{H_{score}}$ than vanilla training ($k=0$), meaning that the use of ABN is effective.
The use of ABN increases the number of network parameters during training and allows the network to achieve good generalization performance.

\begin{figure}[tb]
\centering
\includegraphics[width=1.0\linewidth]{figure/EntProp_ent_vs_rnd.pdf}
\caption{
Comparison of high-entropy sample selection to random selection using ResNet-18 on the CIFAR-100 dataset.
Error bars indicate one standard error, and lines indicate the average.
$k=0$ is the same as vanilla training, and $k=1$ feeds all samples to the ABN-applied network.
}
\label{fig:ent_vs_rnd}
\end{figure}

Next, we verified each component of EntProp to confirm the effect of increasing entropy.
We set $k=0.2$ and $n=1$.
\Cref{tab:ablation} shows the results.
Training clean high-entropy samples with the ABN-applied network improves both SA and RA from vanilla training even though no additional processing, such as adversarial attacks, is performed.
MixUp further improves both SA and RA, while free adversarial training further improves RA but slightly decreases SA.
EntProp which uses all components achieves the highest $\mathrm{H_{score}}$.
Increasing entropy brings the sample further away from the in-distribution domain, allowing effective disentangled learning with mixture distribution.
Moreover, \Cref{fig:entprop_ent_epochs} shows the entropy of the clean and transformed samples when training the network with EntProp.
The results show that EntProp~($k=0.2, n=5$) completely distinguishes between the domains of clean and transformed samples, as we hypothesize.

\begin{table}[tb!]
\centering
\caption{
Ablation study with ResNet-18 on the CIFAR-100 dataset.
% The best and second results are \textbf{bolded} and \underline{underlined}.
The numbers in parentheses indicate the differences from vanilla training.
}
\label{tab:ablation}
\resizebox{\linewidth}{!}{
\begin{tabular}{c|c|c|ccc}
\toprule
Sample Selection & MixUp & Free~($n=1$) & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ \\ \midrule
\checkmark & &  & 79.24(\color{green}{0.79}) & 51.17(\color{green}{1.21}) & 62.18(\color{green}{1.14}) \\
\checkmark &\checkmark &  & \textbf{79.66}(\color{green}{1.21}) & \underline{54.53}(\color{green}{4.57}) & \underline{64.74}(\color{green}{3.70}) \\
\checkmark & & \checkmark & 78.55(\color{green}{0.10}) & 52.99(\color{green}{3.03}) & 63.29(\color{green}{2.25}) \\
\checkmark &\checkmark & \checkmark & \underline{79.41}(\color{green}{0.96}) & \textbf{55.24}(\color{green}{5.28}) & \textbf{65.15}(\color{green}{4.11}) \\ \bottomrule
\end{tabular}
}
\end{table}

\begin{figure}[tb]
\centering
\includegraphics[width=1.0\linewidth]{figure/EntProp_entropy_epochs.pdf}
\caption{
Entropy per epoch when ResNet-18 is trained with EntProp~($k=0.2, n=1$)~(left) and EntProp~($k=0.2, n=5$)~(right) on the CIFAR-100 dataset.
Error bars indicate one standard deviation, and lines indicate average.
}
\label{fig:entprop_ent_epochs}
\end{figure}


\subsection{Detailed Experiments}
In this section, we provide a detailed analysis of the validity of EntProp's design.

\subsubsection{Uncertainty Metric}
We use entropy as a metric to select the samples that EntProp feeds to the ABN-applied network.
We evaluated EntProp~($k=0.2, n=1$) when using the following uncertainty metrics, in addition to entropy, to distinguish between samples in the in-distribution and out-of-distribution domains.
\begin{itemize}
% \item \textbf{Random} randomly determines the samples that EntProp feeds the ABN-applied network.
\item \textbf{Cross-Entropy} is the distance between the true probability distribution and the predicted probability distribution.
\item \textbf{Confidence} is the maximum class probability.
\item \textbf{Logit Margin} is the difference between the maximum non-true class probability and the true class probability.
\end{itemize}
Because we use MixUp during training, the true label used by these metrics is the original true label of the sample.
\Cref{tab:metric} shows the results.
All metrics show no significant differences.
The results show that different architectures have different effective metrics.


\begin{table}[tb]
\centering
\caption{
$\mathrm{H_{score}}$ for different uncertainty metrics on the CIFAR-100 dataset.
% The best and second results are \textbf{bolded} and \underline{underlined}.
}
\label{tab:metric}
\resizebox{\columnwidth}{!}{
\begin{tabular}{l|cccc|c}
\toprule
Metrics & ResNet-18 & ResNet-50 & WRN-50 & ResNeXt-50 & Avg. \\ \midrule
Entropy & \underline{65.15} & 65.93 & 67.00 & 68.08 & \underline{66.54} \\
% Random & 64.86 & 65.69 & \underline{67.15} & 67.84 & 66.39\\
Cross-Entropy & 64.81 & 64.76 & \underline{67.12} & \underline{68.34} & 66.26 \\
Confidence & \textbf{65.48} & \textbf{66.47} & \textbf{67.36} & 67.23 & \textbf{66.63}\\
Logit Margin & 64.84 & \underline{66.18} & 65.71 & \textbf{68.53} & 66.31\\
\bottomrule
\end{tabular}
}
\end{table}


\subsubsection{Design of Data Augmentation}
\label{sec:aug}
We compared MixUp and CutMix~\citep{cutmix} as data augmentations that increase entropy at no additional training cost.
CutMix replaces a part of an image with another image, so it has two labels, similar to MixUp.
\Cref{tab:data_aug} shows the results.
The results show that MixUp significantly outperforms CutMix in RA and $\mathrm{H_{score}}$.
MixUp, which transforms the entire image, is more likely to increase entropy than CutMix, which transforms a portion of the image and contributes to improving $\mathrm{H_{score}}$.
On the other hand, CutMix shows higher SA than MixUp and baseline methods (see \Cref{tab:other_arch}).
Therefore, we use MixUp when the goal is to improve $\mathrm{H_{score}}$ and CutMix when the goal is to improve SA.

\begin{table}[tb]
\centering
\small
\caption{
$\mathrm{H_{score}}$ for different data augmentations with ResNet-18 trained by EntProp~($k=0.2, n=1$) on the CIFAR-100 dataset.
}
\label{tab:data_aug}
\begin{tabular}{l|ccc}
\toprule
Data Augmentation & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ \\ \midrule
MixUp & 79.41 & \textbf{55.24} & \textbf{65.15} \\
CutMix & \textbf{81.39} & 50.78 & 62.54 \\ \bottomrule
\end{tabular}
\end{table}

\subsubsection{Adversarial Attacks Other than PGD}
We investigated the influence of adversarial attacks other than PGD. 
\citet{advprop} experimented with the PGD variants GD and I-FGSM and concluded that the type of attack has no effect on performance.
We experimented with C\&W~\citep{candw} and TRADES~\citep{trades}, which are different types of attacks from PGD. 
% TRADES is a typical attack for adversarial training and uses cross-entropy loss and KL divergence loss. 
We trained ResNet-50 on the CIFAR-100 dataset with EntProp $(k=0.6, n=5)$ using different attacks. 
We show the results in \Cref{tab:other_atk}.
C\&W and TRADES results are inferior to PGD for both SA and RA. 
C\&W is an attack designed to increase DNN misclassification and TRADES is an attack designed to balance DNN accuracy and adversarial robustness, and has a smaller effect on increasing entropy than PGD. Thus, this result confirms the validity of our method design, which uses transformations that increase entropy.
In addition, we verify the effectiveness of free adversarial training against other kinds of attack.
We experimented with introducing free adversarial training into TRADES, an adversarial attack that uses KL-distance loss in addition to cross-entropy loss.
As a result, free adversarial training slightly reduces accuracy, but also reduces training cost by N.
For TRADES, free adversarial training using only cross-entropy is not optimal, but it is effective enough in our research context. 
Thus, free adversarial training generalizes across different kinds of adversarial attacks.


\begin{table}[tb]
\centering
\caption{
Accuracy of ResNet-50 on the CIFAR-100 dataset when changing adversarial attacks used by EntProp.
}
\label{tab:other_atk}
\begin{tabular}{l|ccc}
\toprule
Adversarial Attack & SA(\%) & RA(\%) & $\mathrm{H_{score}}$  \\ \midrule
PGD & \textbf{80.62} & \textbf{60.50} & \textbf{69.12} \\
C\&W & 79.93 & 55.86 & 65.76 \\
TRADES & \underline{80.07} & \underline{58.25} & \underline{67.44} \\
TRADES w/Free & 79.81 & 58.13 & 67.27 \\
\bottomrule
\end{tabular}
\end{table}

\subsubsection{Sample Selection Bias}
\label{sec:bias}
We verified the bias of high-entropy sample selection during training.
\Cref{fig:bias} shows the results.
At $k=0.2$, the bias is large and most samples are not selected as high-entropy samples.
MixUp eliminates high-entropy sample selection bias.
The bias decreases as $k$ increases, but MixUp shows the effect of further reducing the bias.


\begin{figure}[tb]
\centering
\resizebox{\linewidth}{!}{
\begin{tabular}{cc}
\includegraphics[width=0.5\linewidth]{figure/EntProp02K0_index.pdf} &
\includegraphics[width=0.5\linewidth]{figure/EntProp02K1e1MixUp_index.pdf}  \\
(a) & (b) \\
\includegraphics[width=0.5\linewidth]{figure/EntProp04K0_index.pdf} &
\includegraphics[width=0.5\linewidth]{figure/EntProp04K1e1MixUp_index.pdf}  \\
(c) & (d) \\
\includegraphics[width=0.5\linewidth]{figure/EntProp06K0_index.pdf} &
\includegraphics[width=0.5\linewidth]{figure/EntProp06K1e1MixUp_index.pdf}  \\
(e) & (f)
\end{tabular}
}
\caption{
Histogram of the number of times a sample was selected as a high-entropy sample.
The vertical axis is frequency and the horizontal axis is number of times each sample was a high uncertainty.
We trained ResNet-18 on the CIFAR-100 dataset with different values of k and with and without MixUp.
(a) $k=0.2$.
(b) $k=0.2$ w/MixUp.
(c) $k=0.4$.
(d) $k=0.4$ w/MixUp.
(e) $k=0.6$.
(f) $k=0.6$ w/MixUp.
}
\label{fig:bias}
\end{figure}

% \subsubsection{Hyperparameter Analysis}
\subsubsection{How to Determine Hyperparameter}
EntProp has two hyperparameters: $k$, which determines the percentage of sample fed to the ABN-applied network, and $n$, which is the number of iterations of the PGD attack.
These values are determined based on the computational budget or on the validation accuracy.
As shown in the results of the main experiments, large $k$ and $n$ are not the best.
We provide the accuracies for varying $k$ and $n$ in the Appendix.




\section{Limitation}
% The results in \Cref{tab:hypara} show that the higher $k$ of EntProp, the higher adversarial robustness.
In this paper, we focus on improving both standard accuracy and robustness against out-of-distribution domains.
We additionally evaluated the robustness against the adversarial attack.
We evaluated the accuracy of EntProp variants and vanilla training against PGD-20 attack.
\Cref{tab:pgd} shows the results.
Feeding clean high-entropy samples to the ABN-applied network shows higher adversarial robustness than vanilla training, even though adversarial attacks are not used for training.
Free adversarial training significantly improves adversarial robustness, but MixUp significantly decreases it.
% In the comparison of sample selection metrics, random shows the best results rather than using uncertainty metrics.
Comparison of sample selection metrics shows little difference in results across uncertainty metrics.
These results indicate that each component of EntProp designed on entropy is effective in improving standard accuracy and out-of-distribution robustness; however, it is not effective in improving adversarial robustness.
If the objective is a different evaluation metric than ours, it is necessary to design an appropriate metric that is different from the entropy.

\begin{table}[tb]
\centering
\caption{
Adversarial robustness of ResNet-18 on the CIFAR-100 dataset.
}
\label{tab:pgd}
\resizebox{\linewidth}{!}{
\begin{tabular}{c|c|c|c|c}
\toprule
Sample Selection & MixUp & Free~($n=1$) & Metric & PGD-20 \\ \midrule
 & & & & 6.14 \\
\checkmark & &  & Entropy & 6.44 \\
\checkmark &\checkmark &  & Entropy & 4.14 \\
\checkmark & & \checkmark & Entropy & 10.51 \\
\checkmark &\checkmark & \checkmark & Entropy & 4.71 \\
\checkmark &\checkmark & \checkmark & Cross-Entropy & 4.45 \\
\checkmark &\checkmark & \checkmark & Confidence & 4.59 \\
\checkmark &\checkmark & \checkmark & Logit Margin & 4.42 \\
% \checkmark &\checkmark & \checkmark & Random & 5.24 \\
\bottomrule
\end{tabular}
}
\end{table}


\section{Conclusion}
% In this paper we propose a novel method, EntProp, for disentangled learning with mixture distribution.
The existing disentangled learning methods train from mixture distribution by treating clean and transformed samples as different domains, and feeding the former to the MBN-applied network and the latter to the ABN-applied network.
However, it is not appropriate to treat the clean and transformed samples as different domains.
% because some samples are in the same domain and some are in different domains.
We found that when we verified the domains of the samples based on entropy, the clean and transformed samples had overlapping regions of domains.
We hypothesize that further increasing the entropy of clean high-entropy samples generates samples that are further away from the in-distribution domain.
On the basis of the hypothesis, we propose a novel method, EntProp, which feeds high-entropy samples to the ABN-applied network.
% Therefore, we propose to select out-of-distribution domain samples based on entropy and feed only high-entropy samples to the ABN-applied network.
% Furthermore, we achieve further accuracy gains by introducing two techniques, data augmentation and free adversarial training, which bring the samples further away from the in-distribution domain without additional computational cost.
% PGD-5
Our experiments show that EntProp has high accuracy, although its training cost is less than that of baseline methods.
In particular, experiments on the small datasets show that Entprop prevents overfitting against adversarial training and outperforms comparison methods.
Our method improves standard accuracy and out-of-distribution robustness, but has limitations with respect to adversarial robustness.
This limitation suggests the need to design an optimal domain selection metric for each task.

% References
\clearpage
\bibliography{main}

\newpage

\onecolumn

\title{EntProp: High Entropy Propagation for Improving Accuracy and Robustness\\(Supplementary Material)}
\maketitle



The supplementary materials for ``EntProp: High Entropy Propagation for Improving Accuracy and Robustness''.
\appendix

\section{Dataset Details}

\paragraph{CIFAR-100.}
CIFAR-100 dataset consists of 50000 training images and 10000 test images, with 100 classes.

\paragraph{CUB-200-2011.}
CUB-200-2011 dataset consists of 5994 training images and 5794 test images, with 200 classes.

\paragraph{OxfordPets.}
OxfordPets dataset consists of 3669 training images and 3680 test images, with 37 classes.

\paragraph{StanfordCars.}
StanfordCars dataset consists of 8144 training images and 8041 test images, with 196 classes.

\paragraph{ImageNet.}
ImageNet dataset consists of 1.3 million training images and 50000 test images, with 1000 classes.

\section{Comparison with GPaCo}
GPaCo~\citep{gpaco} is a loss function that improves both SA and RA.
We combined EntProp with GPaCo and experimented using ViT-base pre-trained by MAE on the CIFAR-100 dataset.
\Cref{tab:gpaco} shows the results.
The results show that $\mathrm{H_{score}}$ improves when used in conjunction with EntProp.
EntProp can be used in conjunction with recent methods to improve $\mathrm{H_{score}}$.

\begin{table}[tb]
\centering
\caption{
Accuracy of GPaCo and EntProp combinations.
We fine-tuned ViT-base pre-trained by MAE on the CIFAR-100 dataset.
}
\label{tab:gpaco}
\begin{tabular}{l|ccc}
\toprule
Method & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ \\ \midrule
% Vanilla & \underline{89.55} & \underline{70.62} & \underline{78.97} \\
GPaCo & \textbf{89.73} & 70.23 & 78.79 \\
\rowcolor{lightgray} GPaCo w/EntProp($k=0.5, n=5$) & 89.37 & \textbf{70.92} & \textbf{79.08} \\ 
\bottomrule
\end{tabular}
\end{table}

\section{Improvements Through MixUp}
The performance gain from MixUp is significant, but EntProp's performance does not only come from MixUp. 
We used Fast AdvProp and MixUp to train the model on the CIFAR-100 dataset and compare it to EntProp. 
We show the results in \Cref{tab:mixup_improve}.
In all models, EntProp outperforms Fast AdvProp in SA and RA. 
The most significant difference between EntProp and Fast AdvProp is sample selection. 
EntProp is entropy-based, while Fast AdvProp randomly selects samples. 
Entropy-based sample selection works well together because it further accelerates the entropy increase due to MixUp. 
Thus, the improvement we claimed comes from MixUp and entropy-based sample selection.

\begin{table}[tb]
\centering
\caption{
Comparison of performance gains from MixUp on the CIFAR-100 dataset.
}
\label{tab:mixup_improve}
\begin{tabular}{l|ccccccccc}
\toprule
 & \multicolumn{3}{c}{ResNet-18} & \multicolumn{3}{c}{WRN-50} & \multicolumn{3}{c}{ResNeXt-50} \\ 
\cmidrule(r){2-4}
\cmidrule(r){5-7}
\cmidrule(r){8-10}
Method & SA(\%) & RA(\%) & $\mathrm{H_{score}}$& SA(\%) & RA(\%) & $\mathrm{H_{score}}$& SA(\%) & RA(\%) & $\mathrm{H_{score}}$\\ \midrule
Fast AdvProp (w/MixUp) & 78.94 & 54.65 & 64.59 & 80.19 & 57.16 & 66.74 & 80.74 & 57.83 & 67.39 \\
\rowcolor{lightgray} EntProp $(k=0.2, n=1)$ & \textbf{79.41} & \textbf{55.24} & \textbf{65.15} & \textbf{80.66} & \textbf{57.30} & \textbf{67.00} & \textbf{81.46} & \textbf{58.47} & \textbf{68.08} \\ \bottomrule
\end{tabular}
\end{table}

\section{Combination of Fast AdvProp/AdvProp and Entprop}
EntProp (w/o Free Adversarial Training) can be combined with Fast AdvProp/Advprop to improve accuracy. 
We experimented with the combination of Fast AdvProp and EntProp ($k=0.2$) with ResNet-18 on the CIFAR-100 dataset. 
When combined with EntProp, Fast AdvProp uses entropy-based sample selection instead of random sample selection. 
We show the results in \Cref{tab:fa_ss}.
Combination with EntProp slightly increases computational cost due to entropy calculation overhead, but also increases accuracy.
However, it is inferior to EntProp (1.2N) (see \Cref{tab:other_arch}), thus our method design is superior.

Next, we experimented with the combination of AdvProp and EntProp ($k=0.6$) with ResNet-50 on the CIFAR-100 dataset. 
When combined with EntProp, AdvProp uses pure PGD attack without Free Adversarial Training. 
We show the results in \Cref{tab:advprop_ss}.
EntProp reduces the cost of AdvProp and improves accuracy.

\begin{table}[tb]
\centering
\caption{
Accuracy of Fast AdvProp and EntProp combinations.
We trained ResNet-18 on the CIFAR-100 dataset.
}
\label{tab:fa_ss}
\begin{tabular}{l|c|ccc}
\toprule
Method & Cost & SA(\%) & RA(\%) & $\mathrm{H_{score}}$\\ \midrule
Vanilla & N & 78.45 & 49.96 & 61.04 \\
Fast AdvProp & 1.2N & \underline{78.89} & \underline{53.31} & \underline{63.63} \\
\rowcolor{lightgray} Fast AdvProp w/EntProp ($k=0.2$) & 1.4N & \textbf{79.39} & \textbf{55.20} & \textbf{65.12} \\ \bottomrule
\end{tabular}
\end{table}

\begin{table}[tb]
\centering
\caption{
Accuracy of AdvProp and EntProp combinations.
We trained ResNet-50 on the CIFAR-100 dataset.
}
\label{tab:advprop_ss}
\begin{tabular}{l|c|ccc}
\toprule
Method & Cost & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ \\ \midrule
Vanilla & N & \underline{79.30} & 51.01 & 62.08 \\
AdvProp & 7N & 78.05 & \underline{58.94} & \underline{67.17} \\
\rowcolor{lightgray}AdvProp w/EntProp ($k=0.6$) & 4.6N & \textbf{80.85} & \textbf{60.68} & \textbf{69.33} \\ \bottomrule
\end{tabular}
\end{table}

\section{Entropy per Epoch of EntProp Variants}
\Cref{fig:entprop_ent_epochs_ablation} shows the entropy of the clean and transformed samples when training the network with EntProp variant.
Two techniques show that they increase the entropy of the sample.

\begin{figure}[tb]
\centering
\includegraphics[width=0.8\linewidth]{figure/EntProp_entropy_epochs_ablation.pdf}
\caption{
Entropy per epoch when ResNet-18 is trained with EntProp~(w/o MixUp, w/o Free adversarial training)~(left), EntProp~(w/o Free adversarial training)~(center), and EntProp~(w/o MixUp)~(right) on the CIFAR-100 dataset.
Error bars indicate one standard deviation, and lines indicate average.
}
\label{fig:entprop_ent_epochs_ablation}
\end{figure}




\section{Hyperparameter Sensitivity}
We evaluated the relationship between the hyperparameters of EntProp, $k$ and $n$, and accuracy.
\Cref{tab:hypara_k,tab:hypara_n} show the results.
EntProp with $k=0.6$ shows the best $\mathrm{H_{score}}$ than $k=1.0$, which feeds all samples to ABNs.
However, a larger $k$ shows higher adversarial robustness, with $k=1.0$ showing the best results.
Feeding all samples to ABNs leads to overfitting for adversarial attacks.
To improve robustness against out-of-distribution domains, it is effective to feed ABNs with carefully selected samples.
EntProp shows the highest result when the number of iterations $n$ of PDG attacks is 4.
The optimal $n$ depends on the size of the network and dataset.

\begin{table}[tb]
\centering
\caption{
Hyperparameter $k$ sensitivity study using ResNet18 on the CIFAR-100 dataset.
}
\label{tab:hypara_k}
\begin{tabular}{c|cccc}
\toprule
$k$ & SA(\%) & RA(\%) & $\mathrm{H_{score}}$ & PGD-20 \\ \midrule
0 & 78.45 & 49.96 & 61.04 & 6.14\\
0.1 & 79.15 & 54.30 & 64.41 & 3.89 \\
0.2 & 79.41 & 55.24 & 65.15 & 4.71 \\
0.3 & \textbf{79.55} & 55.42 & 65.32 & 5.14 \\
0.4 & 78.90 & 55.55 & 65.20 & 5.14 \\
0.5 & 79.28 & 56.41 & 65.92 & 6.17 \\
0.6 & \underline{79.52} & \textbf{56.66} & \textbf{66.17} & 6.09 \\
0.7 & 79.12 & 56.04 & 65.61 & 5.96 \\
0.8 & 79.24 & 56.36 & 65.87 & 6.58 \\
0.9 & 78.95 & 56.20 & 65.66 & \underline{6.82} \\
1.0 & 79.44 & \underline{56.61} & \underline{66.11} & \textbf{7.25} \\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[tb]
\centering
\caption{
Hyperparameter $n$ sensitivity study using ResNet18 on the CIFAR-100 dataset.
}
\label{tab:hypara_n}
\begin{tabular}{lccc}
\hline
n & SA(\%) & RA(\%) & $\mathrm{H_{score}}$\\ \hline
1 & 78.89 & 55.86 & 65.40 \\
2 & \textbf{79.54} & \underline{57.46} & \underline{66.72} \\
3 & 79.15 & 57.25 & 66.44 \\
4 & \underline{79.40} & \textbf{58.04} & \textbf{67.06} \\
5 & 78.92 & 57.16 & 66.30 \\ \hline
\end{tabular}
\end{table}

\end{document}
