% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} 
%% In your camera-ready you should use the 'accepted' parameter. This shows the authors and how an accepted paper will look like. The footer is 'Acccepted for X'. In the final version, the proceedings chairs will add the page numbers for PMLR and the final footer will be 'Proceedings of X'.
%
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage[american]{babel}
% \usepackage[british]{babel}

\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{multirow}
\usepackage{lipsum}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Cyclic Test Time Augmentation with Entropy Weight Method}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
% 
% Important: in case of equal contributions, we strongly recommend to NOT show it in this part of the paper, but rather describe it in the appropriate section at the end of the paper "Author Contribution", where you have more space to describe how each author contributed.
%
% Add authors
% Remember to use the order convention "First/Given name" "Last/Family name", e.g. John Smith, Hanako Yamada, Marco Rossi, Wei Zhang
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Jane~J.~von~O'L\'opez}{}}
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Sewhan ChunJane~J.~von~O'L\'opez}{}}

\author[1]{Sewhan Chun}
\author[2]{Jae Young Lee}
\author[2]{Junmo Kim}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    NAVER CLOVA\\
    Republic of Korea
}
\affil[2]{%
    SIIT Lab\\
    KAIST\\
    Republic of Korea
}
% \affil[1]{%
%     Computer Science Dept.\\
%     Cranberry University\\
%     Pittsburgh, Pennsylvania, USA
% }
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }
  
  \begin{document}
\maketitle

\begin{abstract}
In the recent studies of data augmentation of neural networks, the application of test time augmentation has been studied to extract optimal transformation policies to enhance performance with minimum cost. The policy search method with the best level of input data dependency involves training a loss predictor network to estimate suitable transformations for each of the given input image in independent manner, resulting in instance-level transformation extraction. In this work, we propose a method to utilize and modify the loss prediction pipeline to further improve the performance with the cyclic search for suitable transformations and the use of the entropy weight method. The cyclic usage of the loss predictor allows refining each input image with multiple transformations with a more flexible transformation magnitude. For cases where multiple augmentations are generated, we implement the entropy weight method to reflect the data uncertainty of each augmentation to force the final result to focus on augmentations with low uncertainty. The experimental results show convincing qualitative outcomes and robust performance for the corrupted conditions of data.
\end{abstract}

\section{Introduction}
\label{sec:intro}
Study of test time augmentation (TTA) is a field of data augmentation, which involves transforming an input image to augment different forms of itself for neural network prediction during the test time. This generates multiple softmax outputs, which can be integrated by averaging them to extract the final single output. Such a method has been known to result in more robust and better performance from the neural networks \citep{Alexnet,TTA_Uncertainty_pitfall}. Conventionally, which transformations to use are heuristically set in global-level (\emph{i.e.} performing the same types of augmentation to all the input data) for the domain.

However, there are limitations for conventional TTA. The major concern is the cost. TTA policy refers to the scheme of how many augmentations of what transformations with what magnitude for each augmentation would be utilized \citep{TTA-Policy_GPS}. While increasing the number of augmentation in the policy usually results in better performance, the cost requirement has to increase in a multiplicative manner. Because of such poor cost efficiency, many TTA applications can be found in tasks where accuracy plays an important role, such as artificial intelligence competitions and medical or biological image processing \citep{Alexnet,TTA_med_1,TTA_med_2,TTA_med_3}.

Another concern involves the inflexibility of the policy. While the suitable policy should maintain intra-class invariance (\emph{i.e.} invariance of the label under transformation) and inter-class distinctiveness (\emph{i.e.} ability to maintain distinctive features to distinguish between classes) of input data to the model \citep{TTA-Policy_APAC,TTA-Policy_WhenAndWhy}, in conventional scheme, the policy is found heuristically and applied in global-level. This could bring disruption and inconvenience to the policy establishment. For example, a horizontal flip is known to be a common and effective TTA transformation with intra-class invariance and inter-class distinctiveness for most images from ImageNet dataset \citep{Alexnet,Dataset_Imagenet}. However, from MNIST dataset \citep{Dataset_MNIST}, while visually symmetric numbers (\eg ``1",``8") could be acceptable to such transformation, orientation-sensitive images (\eg ``7",``6",``2",``5") could lose their intra-class invariance and inter-class distinctiveness from the flipping, losing features to classify them as their original labels.

To overcome such limitations, trainable TTA policy search methods were introduced. These approaches aim to structure the most suitable TTA policy as an optimization problem, finding the most helpful augmentations from various candidates of transformations and their magnitudes. From Greedy Policy Search (GPS) \citep{TTA-Policy_GPS}, multiple augmentations can be generated in the policy, where each augmentation is regarded as a sub-policy, capable of consisting of multiple transformations with corresponding magnitudes. While GPS has a global-level TTA scheme, some of the studies aim to find more specific levels of data dependency of TTA policy, namely class-level and instance-level (\emph{i.e.} applying transformations to the input image depending on individual input data condition).

Trainable TTA policy has also contributed to the robustness of neural network prediction. Contrary to the promising performance of neural networks, it has been studied that they could be vulnerable to perturbations or corruptions in data \citep{Robustness_Adversarial,Robustness_Imagenet-C}. Many studies in data augmentation methods have achieved strong robustness \citep{DataAugmentation_augmix,DataAugmentation_autoaugment, DataAugmentation_fastautoaugment} against the damages. Previous works \citep{TTA-Policy_L2T,TTA-Policy_GPS} showed that TTA could also improve the robustness. With a suitable TTA policy, corruption in the image could be suppressed by modifying the test image directly via suitable transformations. Kim et al. \citep{TTA-Policy_L2T} has recently introduced the first instance-level TTA policy search method, where which transformation to proceed is determined by the condition of each instance of input image. With the application of a loss predictor, their work was able to achieve robustness improvement with only a small amount of additional computation cost.

In this work, we introduce cyclic TTA with entropy weight method (EWM) in classification task by implementing multiple transformations and reflecting uncertainty directly to each prediction result from augmentations. As we follow that the instance-level TTA is the effective level of the data dependency, we believe that there is more potential room for improvement to the loss prediction pipeline \citep{TTA-Policy_L2T} in terms of flexibility. With an iterative maneuver of the loss predictor, each image can be assigned with multiple transformations with a more flexible magnitude. For multiple augmentations case, we also introduce the implementation of modified EWM to attenuate the softmax output with high data uncertainty. Because the cost for the calculation of the entropy is relatively minor, the EWM can easily be adapted to improve the robustness of network prediction. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%-------------------------------------------------------------------------
\section{Related Works}
\label{sec: related works}

%\subsection{Test Time Augmentation}
{\bf Test time augmentation:} TTA for neural network prediction has been used for a while. Many innovative neural network performances on ImageNet dataset \citep{Dataset_Imagenet} used TTA method \citep{Alexnet,GoogleNet,VGG,Resnet} for their records, using augmentations of numerous cropped patches from the original images. This helped to result in accuracy improvement with the multiplicative cost increase. For TTA's capability to directly modify the data during test time, TTA has been studied to possess more potentials, such as uncertainty estimation to data distillation \citep{TTA_Uncertainty_med1, TTA_Uncertainty_med2, Data_Distillation}. TTA policy search is one of the attempts to find the solution to the cost limitation and further improvement of its effectiveness. Sato et al. \citep{TTA-Policy_APAC} were one of the first to analyze TTA policy, building an optimal decision rule to achieve improvement in generalization. GPS \citep{TTA-Policy_GPS} was introduced as a learnable global-level TTA policy search method, greedily building a global-level policy. GPS showed excellent improvement in accuracy and robustness, performing multiple transformations with flexible magnitude to each sub-policy. Shanmugam et al. \citep{TTA-Policy_WhenAndWhy} proposed a TTA policy with class-level data dependency. Their work involves training a set of parameters to learn the relation between class and each augmentation and to use it as a post-processing method to extract one final prediction from the multiple predictions. Recently, Kim et al. \citep{TTA-Policy_L2T} proposed a TTA policy with instance-level data dependency. They had trained a loss predictor to be capable of predicting which transformation would be suitable for the target network (\emph{i.e.} the main classifier used for the task). Their contribution is that such a pipeline is very cost-efficient, even with a single suitable augmentation could increase the robustness effectively. However, unlike GPS, such a pipeline could not implement multiple transformations with flexible magnitudes on a single image, each sub-policy to intake a single transformation from a set of predefined transformations.

%\subsection{Robustness to Corruption}
{\bf Robustness to Corruption:} While modern neural networks achieve high performance, exceeding human capabilities, many studies show that they can easily malfunction for corruptions and perturbations from various sources from real-life implementations \citep{Robustness_Adversarial,Robustness_Imagenet-C}. Hendrycks et al. \citep{Robustness_Imagenet-C} introduced a benchmark for corruptions with ImageNet data, namely ImageNet-C, simulating 19 different types of corruption for network robustness evaluation. Many data augmentation approaches \citep{DataAugmentation_augmix,DataAugmentation_autoaugment,DataAugmentation_fastautoaugment} were introduced to enhance the robustness, resulting in significant improvement for various kinds of corruptions.
 
%\subsection{Uncertainty Estimation}
{\bf Uncertainty Estimation:} Uncertainty estimation acts as an indicator for the confidence of network prediction. Many applications in deep learning involve implementation of the uncertainty to provide additional information for the final prediction \citep{Uncertainty_estimation}. In the field of active learning, where uncertain data are queued to be labeled from a set of unlabeled data, a loss value can be used as a means for estimation of the uncertainty. In the case of active learning classification \citep{ActiveLearning_Lossprediction}, a separate loss predictor module can be trained to estimate expected loss magnitude with much less cost, providing a faster and more efficient method to find samples with high expected loss value to queue and select the uncertain unlabeled data. In this case, the loss value can be regarded as an indication of how uncertain the data is for the target network (the classifier), for samples with high loss values bring relatively major change to the condition of the neural network.

From another point of view, according to the previous study \citep{Uncertainty_Ensemble}, overall uncertainty measurement from neural network prediction can be divided into knowledge uncertainty and data uncertainty. In this paper, we focus on the data uncertainty, irreducible uncertainty due to the nature of complexity or noise in the data. In the classification task, data uncertainty can be calculated as the expected entropy value of softmax outputs. The expectancy can be calculated by averaging the entropy values from multiple predictions (softmax outputs) by multiple models from an input data.


%\subsection{Entropy Weight Method}
{\bf Entropy Weight Method:} In the field of decision making, EWM is used to reflect the degree of disorder of a system \citep{EWM_1,EWM_2}. Many studies in water quality assessment use EWM to reflect the uncertainty among the samples to diminish the importance weights of uncertain assessment parameters. The weights indicate the importance of parameters for the quality assessment and are calculated to be large for low entropy and vice versa. For example, a type of substance (\emph{i.e.} a parameter) detected with a uniform amount from the majority of samples would gain less weight than other parameters, due to the high entropy from the uniformity. With some adaptive modification from previous EWM, we observe that a network prediction might be similarly regarded as the sample from the field of decision making. By reflection of the entropy to the network predictions, we seek to improve the robustness of the predictions with only a small extra cost to calculate the entropy.
%------------------------------------------------------------------------


\section{Method}
\label{sec:method}

% should I explain what conventional TTA stand as? as equation?

% \begin{figure*}
%   \centering
%   \begin{subfigure}{0.68\linewidth}
%     \fbox{\rule{0pt}{2in} \rule{.9\linewidth}{0pt}}
%     \caption{An example of a subfigure.}
%     \label{fig:short-a}
%   \end{subfigure}
%   \hfill
%   \begin{subfigure}{0.28\linewidth}
%     \fbox{\rule{0pt}{2in} \rule{.9\linewidth}{0pt}}
%     \caption{Another example of a subfigure.}
%     \label{fig:short-b}
%   \end{subfigure}
%   \caption{Example of a short caption, which should be centered.}
%   \label{fig:short}
% \end{figure*}



\begin{figure*}[t]
    \centering
    \includegraphics[scale=0.53]{images/loss_predictor_pipeline_overall.PNG}
    \caption{\label{figure:standard} Illustration of the loss prediction pipeline \citep{TTA-Policy_L2T}. (a) Loss prediction for which transformation $T$ to take on the corrupted image of "chiton" during testing. $\tau_{a,b}$ indicates the predefined transformation of type $a$ with its magnitude $b$. (b) Training algorithm of the loss predictor $\theta_{LP}$. During the training, an input image $x$ is transformed into all of the predefined transformations $\tau$ to produce loss values $y_{loss}(\tau(x))$ by making predictions with the target network $\theta_{target}$. These loss values are given to the loss predictor $\theta_{LP}$ as target values after softmax normalization and as Spearman correlation ranking loss \citep{sodeep}. The loss predictor intakes the resized input image to learn the correlation between the target network results from the transformed images and the downsized original image condition.}
 \end{figure*}
 
Our method includes cyclic modification of the loss prediction pipeline and implementation of the EWM. In section 3.1, we introduce our baseline, the previous loss prediction pipeline illustrated in Figure~\ref{figure:standard}, and the modifications for our method. The cyclic application of the loss predictor will be explained in section 3.2. The iterative manner of transformations tries to find an optimal condition for a given input image. Compare to the previous work, such application contributes to additional flexibility of transformations in TTA policy. The difference between the former method is illustrated in Figure~\ref{figure:comparison}. In section 3.3, the modifications and implementation of EWM are explained. In case of multiple augmentations case, where more than one augmentation are used for TTA policy, we aim to reflect the data uncertainty to each augmentation. For uncertainty estimation, we refer to the well-stated definition of the uncertainty by Malinin et al. \citep{Uncertainty_Ensemble}, considering the entropy from softmax output could represent the data uncertainty (with a difference in that we only use a single network prediction to calculate the data uncertainty).

\subsection{Loss Prediction for the Transformation Estimation}

% introduce baseline, modifications for our method.

% application
Kim et al. \citep{TTA-Policy_L2T} introduced an innovative loss prediction pipeline for instance-level image augmentation during test time. As presented in Figure~\ref{figure:standard}, a loss predictor aims to find a suitable transformation among predefined transformations for an input image to be prepared for the target network (\emph{i.e.} classifier). During the test time, an input image is resized and evaluated by the loss predictor. The loss predictor predicts the expected losses for each of the presumable target network predictions with transformed augmentations from the predefined set of transformations. In other words, the loss predictor tells of what transformation would result in the best outcome for the target network, as the least predicted loss value would refer to the transformation with the best condition. The transformation corresponding to the minimum predicted loss is selected as the top 1 choice for the sub-policy. In the case of a single augmentation, such a pipeline guides an input image to go through the suitable transformation, making the classifier to predict from the transformed condition of the image. 

% training
{\bf Training the loss predictor:} Training the loss predictor requires the target network to make predictions with an input image in multiple augmented forms in the manner of predefined transformations. During the training, the target network is frozen, only making predictions. For each prediction, cross-entropy loss values from the multiple augmented images are calculated. The loss values from the augmentations are softmax normalized and are fed to the loss predictor as the target values, as Spearman correlation ranking loss \citep{sodeep}. Ultimately, the loss predictor learns to find which transformation is required to result in the smallest loss value, as the image is evaluated by the target network. Being able to predict with suitable transformation to extract the smallest loss value, the input image has more chances to be classified with the correct answer. % should i say the loss predictor learns the features of the corruption???





% Extra information, cifar might require
%% We use the training data
For training the loss predictor on the ImageNet dataset, training data used for training the target network are reused. Although training the loss predictor with a separate validation set seems to be more suitable, for the loss values by the target network prediction from training data would not perfectly simulate the actual test condition, regardless, it has been reported that they do not make much difference in performance. % After all, it implies that the role of the loss predictor would be to find which transformation would well suppress the corruption condition, not to directly classify the input data.

Additionally, in order to build robustness to corruptions, random sequences of corruption, from the previous study by \citep{Robustness_Imagenet-C}, were given to the input images, simulating various types of real-life conditions of the images.


% Talk about the transformations. How and why they were used.
{\bf Loss predictor architecture:} For the network architecture of the loss predictor, EfficientNet-B0 \citep{EfficientNet} is used as the backbone. Architectural modifications were taken to utilize multi-level features of input as the active learning loss predictor \citep{ActiveLearning_Lossprediction}. The loss prediction pipeline is stated to be cost efficient because the cost for the loss prediction with such a network architecture is relatively negligible to that of the target networks used for the classification \citep{TTA-Policy_L2T}. Downsizing the image into 64 by 64 pixels has allowed such cost efficiency and aimed for the loss predictor to learn low level features as well. % This indicates that the additional cost is almost ignorable the application of such loss prediction should not be a major issue 

{\bf Transformation candidates:} As for the predefined transformations, in our method, we have modified the transformation magnitudes to simulate more flexible outcome. The types of transformation include: Identity, Rotation, Zoom, Auto Contrast, Blurring, Sharpening, and Color Saturation. Including the magnitude configurations for each transformation, our method composes 12 different transformation candidates. In Appendix A, we explain the details about the transformations. Overall, the loss predictor suggests one of these transformations with the least expected loss value, which then the transformations takes place to be ready for the target network prediction.

%% corruption explanation

{\bf Multiple augmentations:} In the case of $k>1$ number of augmentations are used, the top $k$ transformations from the loss predictor suggestion are selected to generate the corresponding $k$ number of augmentations. In case of not using the EWM, classification results from the augmented images are integrated in a conventional manner, averaging the softmax outputs.



\subsection{Cyclic TTA}

{\bf Cyclic loss prediction:} Contrary to the former study, our work utilizes the loss predictor in a cyclic manner as shown in Figure~\ref{figure:comparison}. Once the image is transformed according to the prediction by the loss predictor, instead of being directly processed by the target network, the modified image is again fed to the loss predictor, forming a cycle. The image goes through the cycle continuously, until the exit signal is activated. We set two conditions for the exit signal to be activated. The first is when the loss predictor predicts the input image should perform identity transformation. This indicates that the image no longer requires additional transformations to result in better condition, ideally presuming an optimal condition of the image. The second condition is when the number of cyclic iteration reaches the predefined hyper parameter of maximum number of the iteration. Because our loss predictor is not perfect to predict the suitable transformation, to prevent rarely happening unbounded continuity of the cyclic loss predictions, we set certain limitation to the number of cycle the loss predictor iterates. Such simple modification can expand the transformation space into a much larger volume of possible combinations from the set of predefined transformations. Given that $T$ and $m$ refers to the number of transformation candidates and the maximum number of iteration respectively, transformation space in our method can be written as $T^{m}-T^{m-1}+1$. While our baseline had $m=1$ to have the $T$ number of transformation possibilities, it can be shown that larger $m$ in our method opens for more potential candidates for the input image to be transformed into.



%equation of transformation space expansion

\begin{figure*}
    \centering
    \includegraphics[scale=0.55]{images/loss_cyclic_predictor_pipeline_comparison.PNG}
    \caption {\label{figure:comparison}{\bf Top:} Comparison between the previous method ({\bf left}) and the cyclic ({\bf right}) loss prediction pipeline. $T_t$ indicates the suggested transformation at iteration $t$. {\bf Bottom:} Expanded illustration of the cyclic loss prediction. The input image of a "king snake" is corrupted with snow corruption. The image goes through iterative loss prediction cycles until it meets the exit signal. $t_{\tau_{identity}}$ indicates the iteration when the loss predictor suggests identity transformation, which is an exit signal.} 
\end{figure*}


For a severely corrupted input image, a single iteration of transformation might not be sufficient to suppress the corruption. For example, if an image should be corrupted by a severe Gaussian noise, following the former method, a blurring transformation would be selected and performed to remove the noise. However, it is possible to leave the residual noise component, for the magnitude of the transformation is predefined and only performed once. On the other hand, cyclic iterations of transformation could continuously try to remove the noise until the loss predictor predicts the condition of the image to be well suited for the classification. In such behavior, it is possible for the cyclic TTA to provide more flexible and multiple types of transformation maneuver as a preprocessing for the task.

% RNP
Training the loss predictor for cyclic TTA involves dealing with multiple number of corruptions to the input data. The input data are applied with multiple number of corruptions, with similar behavior as the loss predictor from \citep{TTA-Policy_L2T}, the loss predictor is trained to predict what transformation could suppress the corruptions and to result in the least expected loss.

% to simulate cyclic TTA case, random iterative modification of simulation of t value are realized...?

{\bf Multiple augmentations:} In case of $k>1$ augmentations are to be used, we prepare $k$ number of original images to be processed. In the first iteration of $t=1$, each image is transformed according the top $k$ transformations from the loss prediction respectively. Starting from the second iteration, unless the exit signal is activated for each augmentation, each image will proceed as normal cyclic behavior, each selecting the top 1 suggestion from their each loss predictions. In short, each of $k$ augmentations starts with different transformation at $t=1$ and proceeds the cyclic TTA independently. Ideally, if the loss predictions were to be very accurate, all $k$ transformed images would present similar features, assuming that there is only one optimal condition of the input image to be prepared for the classification. In the end, $k$ number of target network predictions are generated as softmax outputs. Assuming the EWM is not used, these are averaged to extract a final prediction for each input image.   % accuracy improvement by increasing k is not large for this reason?? give out some pictures?

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%% Saving point
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


{\bf Cyclic TTA cost:} As previously mentioned, the cost for the loss prediction is relatively trivial to that of the target network prediction. For example, our experiments on ImageNet involves a target network takes 4.1 GFLOPs, whereas loss predictor with downsized input image only requires 2.6 MFLOPS. Although our cyclic loss prediction requires multiple iteration of the loss prediction and transformation, because the number of iteration can be controlled with a hyper parameter of maximum number of iteration and the cost of the loss prediction is relatively small, such pipeline can sustain somewhat similar cost efficiency compare to that of our baseline.  % flops should be more suitable.  

% oracle
%To suggest the upper bound for the cyclic TTA, we show oracle  to the baseline,  



%In addition to identity transformation as the exit signal, we set an additional exit signal activation conditions, hyper parameter of the maximum number of iteration for the cyclic behavior. In our experiment, we have examined that the availability of an infinite number of iteration sometimes deteriorates the image, due to imperfect prediction of the loss predictor. We see that excessive transformation can rather corrupt the image condition. Thus, we have limited the number of transformations for each image could to perform. %Consider this to be in the discussion or more early stage of the writing....


\subsection{Entropy Weighted Summation} % Maybe I should state equation for conventional TTA here....
{\bf Average integration:} In conventional case of using multiple augmentations for TTA, the integration of the softmax outputs is performed by averaging them. In case of classification task with $n$ classes and $m$ augmentations are used, conventional method to extract the final prediction score for class $j$ 	($\le$ $n$) can be calculated as
\begin{align}
    \label{equation:conventional_integration}
    %p_{final_j} = \left(\tfrac{1}{m}\right )\cdot\sum_{i=1}^{m} p_{i,j}\;,
    p_{final_j} = \tfrac{1}{m}\cdot\sum_{i=1}^{m} p_{i,j}\;,
\end{align}

where $i$ represents the augmentation index and $p_{i,j}$ indicates softmax output element of class $j$ from augmentation $i$. Then, the final classification is decided by choosing the class index $j$ with the maximum value of $p_{final_j}$. Such integration implies weighting each prediction with the same importance. On the contrary, we see that certain augmentations can be more important to provide correct prediction \citep{TTA-Policy_WhenAndWhy}. For example, for cropping multiple image patches from an original input image, augmentations can be generated each with a different view. Certain patches might not contain essential features, for parts of the original image could be excluded from cropping. In this case, considering these augmentations as the same importance as the others with more correct information could bring disturbance to the final prediction. The illustrations of such cases are present in Appendix C.
% discussion: in our experiments, we saw that the average provide suitable solution, because (usually??????) either logit or softmax output scores produces less loss values? or higher score for correct label when the prediction is correct. in this way, incorrect prediction presents relatively small score for its maximum score for incorrect label. temperature calibration can help???? in appendix???

{\bf EWM integration:} Inspired by previous works from the field of decision making \citep{EWM_1,EWM_2,EWM_3}, we state that each prediction made by corresponding augmentation can be regarded as a sample data with a probability distribution for which decision to make with corresponding implicit uncertainty. We modify the previously established EWM to implement in the neural network prediction. While EWM calculates the entropy among samples of data to calculate weights for evaluation parameter, we calculate the entropy $E_{i}$ of augmentation $i$ as
\begin{align}
    \label{equation:modified_ewm}
    E_{i} = {-\sum_{j=1}^{n}p_{i,j}\cdot \ln p_{i,j}}\;\;.
\end{align}
The entropy is then used to extract the weight $w_{i}$ with softmax normalization for each augmentation $i$:
\begin{align}
    \label{equation:modified_ewm}
    w_{i} = \left(\tfrac{e^{E_{i}}}{\sum_{i=1}^{m}e^{E_{i}}}\right )^{-1}.
\end{align}
By having the reciprocal of softmax entropy to calculate weights, each weight represents how much each augmentation is certain for its prediction. As for the integration of the predictions from the $m$ augmentations, the final prediction score for class $j$ element is calculated as
\begin{align}
    \label{equation:modified_ewm}
    p_{final_j} = \sum_{i=1}^{m}w_{i}\cdot p_{i,j}\;.
\end{align}
As same as the conventional method, final classification result is done by choosing the class index with the maximum value of $p_{final_j}$.
Considering the definition of the data uncertainty by Malinin et al. \citep{Uncertainty_Ensemble}, modified EWM can be regarded as the reflection of data uncertainty to each augmented data, focusing more on less uncertain augmentation and vice versa. Ideally, calculation for more accurate level of data uncertainty involves using more than one neural network. Regardless, with such reflection of the uncertainty, our experiments show that the network prediction can extract more robust predictions to the corrupted data in case of using multiple number of augmentations. % Ideally, the entropy should be calculated by averaging multiple entropy calculations from multiple predictions by ensemble, but that is not the case at the moment,,,, so,,,, haha.... maybe later...?

%Application of EWM could refer to further examining the uncertainty after the loss prediction. While the cyclic loss prediction directly transforms the data, EWM can further examine the confidence of each sample and weigh more importance on samples with more confidence. Due to the data change from the original form by the cyclic loss prediction pipeline, further examination of the uncertainty by EWM can reflect the ambiguity suites well in such case to reflect the ambiguity of each cyclically transformed input, as in case of water quality assessment from our related works, where water sample from same area can exhibit varying characteristics.


%------------------------------------------------------------------------


\section{Experiments}
\label{sec:intro}

\subsection{ImageNet Classification}

We experiment the effect of cyclic behavior of the loss predictor and EWM on ILSVRC 2012 dataset \citep{Dataset_Imagenet}. ImageNet contains 1.2 million images with 1000 classes of real life objects. In addition to the clean condition of the data, we also evaluate our method on ImageNet-C dataset \citep{Robustness_Imagenet-C}, where various types of corruption are simulated with 5 different severity. The corruptions from the ImageNet-C include 19 different types of algorithmically generated corruptions from noise, blur, weather, digital, and extra categories. While typical error rate is used for the evaluation in clean data, to evaluate the robustness of neural network performance, mean corruption error $(mCE)$ metric is used \citep{Robustness_Imagenet-C}. Overall, in order to evaluate a single iteration of $mCE$, 50,000 (ImageNet validation data size) $\times$ 5 $\times$ 19 samples with size of 224 $\times$ 224 are used. % more explanation on mCE refering to the performance to the Alexnet??

In Table~\ref{table:performance}, we show performance with using ResNet-50 \citep{Resnet} as the target networks for the pipeline. The networks are trained in two different fashions: standard and Augmix \citep{DataAugmentation_augmix}. Performances from each data augmentation are presented. For comparison, the typically used TTA methods are selected (the typical TTA methods are described in detail in Appendix A). These methods are widely and frequently used conventional TTA methods shown to improve accuracy in many cases \citep{Alexnet,GoogleNet,VGG,Resnet}. Additionally, we compare our methods to the previous method \citep{TTA-Policy_L2T}, making a single transformation prediction for each image. For each test case, relative costs are presented. These costs only concern the computation load for the classification, for the load for transformations and loss prediction is relatively menial. For the integration method of how multiple predictions extract the final single prediction, we compare conventional average integration to our EWM method. Performance on the clean condition and the corrupted conditions are labeled as Clean and $mCE$ respectively. Smaller value indicates better performance.

\begin{table*}[ht!]
\centering
\rule{126mm}{1pt}\\%
\def\arraystretch{1.3}
\begin{tabular}{ccccccc}

\multirow{2}{*}{Train Time Augmentation}        & \multirow{2}{*}{TTA Method}                         & \multirow{2}{*}{Cost}   & \multicolumn{2}{c}{Average}                 & \multicolumn{2}{c}{EWM (Ours)} \\ \cline{4-7} 
                                                &                                                     &                         & Clean & \emph{mCE}                                 & Clean     & \emph{mCE}                \\ \hline
\multicolumn{1}{c|}{\multirow{10}{*}{Standard}} & \multicolumn{1}{c|}{Center Crop}                    & \multicolumn{1}{c|}{1}  & 24.14 & \multicolumn{1}{c|}{75.79}          &           &                    \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{Horizontal Flip}                & \multicolumn{1}{c|}{2}  & 23.76 & \multicolumn{1}{c|}{74.77}          & 23.78     & 74.75              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{5 Crops}                        & \multicolumn{1}{c|}{5}  & 23.57 & \multicolumn{1}{c|}{74.37}          & 23.47     & 74.22              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{10 Crops}                       & \multicolumn{1}{c|}{10} & 23.04 & \multicolumn{1}{c|}{73.57}          & 23.05     & \textbf{73.34}              \\ \cline{2-7} 
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{\multirow{3}{*}{Single}}        & \multicolumn{1}{c|}{1}  & 24.15 & \multicolumn{1}{c|}{74.14}          &           &                    \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{2}  & 24.04 & \multicolumn{1}{c|}{73.36}          & 24.03     & 73.26              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{3}  & 23.84 & \multicolumn{1}{c|}{73.23}          & 23.85     & 73.08              \\ \cline{2-7} 
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{\multirow{3}{*}{Cyclic (Ours)}} & \multicolumn{1}{c|}{1}  & 24.15 & \multicolumn{1}{c|}{\textbf{73.69}} &           &                    \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{2}  & 24.04 & \multicolumn{1}{c|}{\textbf{73.13}}          & 24.06     & 73.08              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{3}  & 23.81 & \multicolumn{1}{c|}{\textbf{72.74}}          & 23.81     & 72.68              \\ \hline
\multicolumn{1}{c|}{\multirow{10}{*}{Augmix}}   & \multicolumn{1}{c|}{Center Crop}                    & \multicolumn{1}{c|}{1}  & 22.39 & \multicolumn{1}{c|}{65.07}          &           &                    \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{Horizontal Flip}                & \multicolumn{1}{c|}{2}  & 22.15 & \multicolumn{1}{c|}{64.35}          & 22.16     & 64.31              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{5 Crops}                        & \multicolumn{1}{c|}{5}  & 21.69 & \multicolumn{1}{c|}{63.56}          & 21.68     & \textbf{63.35}              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{10 Crops}                       & \multicolumn{1}{c|}{10} & 21.56 & \multicolumn{1}{c|}{63.05}          & 21.49     & \textbf{62.76}     \\ \cline{2-7} 
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{\multirow{3}{*}{Single}}        & \multicolumn{1}{c|}{1}  & 22.37 & \multicolumn{1}{c|}{64.34}          &           &                    \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{2}  & 22.31 & \multicolumn{1}{c|}{63.82}          & 22.30     & 63.77              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{3}  & 22.33 & \multicolumn{1}{c|}{63.86}          & 22.34     & 63.73              \\ \cline{2-7} 
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{\multirow{3}{*}{Cyclic (Ours)}} & \multicolumn{1}{c|}{1}  & 22.37 & \multicolumn{1}{c|}{\textbf{64.14}}          &           &                    \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{2}  & 22.33 & \multicolumn{1}{c|}{63.77}          & 22.31     & 63.74              \\
\multicolumn{1}{c|}{}                           & \multicolumn{1}{c|}{}                               & \multicolumn{1}{c|}{3}  & 22.33 & \multicolumn{1}{c|}{63.68}          & 22.31     & 63.62              \\ 
\end{tabular}
\rule{126mm}{1pt}%

\caption{\label{table:performance} Performance comparison of the previous methods with the proposed method on ImageNet and ImageNet-C. Fourth column indicates averaging for integrating the predictions from multiple augmentations. Fifth column shows the performance with the EWM. Single TTA method refers to the previous method by \citep{TTA-Policy_L2T}. Cyclic refers to our method. It is bold when either cyclic method or EWM method shows performance improvement of 0.2\% or more.}
\end{table*}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Evaluation} % quatitative evaluation...

From the results from the clean condition of ImageNet, we observed that the loss prediction pipeline has a little and inconsistent impact on the error rate. In most cases, identity transformation is selected by the loss predictor, indicating the data are already in good condition for prediction. For the corrupted data, as the single prediction reduces the $mCE$, cyclic TTA contributes to further improvement. For the target network trained with Augmix, the network has already built strong robustness from the corruption. In this case, both loss prediction pipeline shows minor improvement. % we can also see that cyclic cost of 3 becomes less improvement. this indicates that images become similar. should be able to present these to qualitative analysis???


% high severity based problem --> better!!

For EWM, while cases of clean data are minorly affected as well, it showed general improvement in the corrupted data. For convention TTA with EWM, as the number of augmentations increases from horizontal flip to 10 crops, the improvement in $mCE$ has increased. This indicates that, as more augmentations are used, more candidates to reflect the uncertainty are available to extract certain and correct answers.

% plotting should be done? for illustration??

%----------------------------------------------------------------------------------------------------------------------

% qualitative? evaluation... If CIFAR is not available???

\subsection{Cyclic Usage Of Oracle-TTA}
\label{sec:Cyclic Oracle-TTA}
Kim et al. \citep{TTA-Policy_L2T} suggested a hypothetically perfect loss predictor named Oracle-TTA to simulate the performance upper bound for the loss prediction pipeline. Oracle-TTA is assumed to be able to accurately predict which transformation is required for the input image to result in the smallest loss value. Hypothetical performance using the Oracle-TTA suggests the potentials in the pipeline. As for comparison, we suggest that the cyclic usage of the Oracle-TTA can further improve the upper bound, for the flexibility in the transformation can provide more optional conditions to the input image for the target network. In appendix D, we compare the upper bound for cyclic TTA to that of our baseline. The performance records show that, with a well trained loss predictor, more rooms for improvements are available as the number of maximum transformation for the cyclic TTA increases.


\subsection{Discussion}
\label{sec:Discussion}
In Appendix B, we illustrate the visual comparison of image conditions between center crop, single iteration, and cyclic iteration methods. We examined that corrupted images can restore some of their features to become closer to their clean condition via multiple iterations of transformations. Additionally, even clean images with ambiguity tend to restore their features, becoming to have similar features to the images of the same class. In our experiment, we have analyzed the results to conclude that the cyclic TTA was more effective on corrupted images with higher corruption severity and less effective on that of lower corruption severity than our baseline. This is because data with the high severity clearly requires more transformations to restore their features. Moreover, being less well on the lower severity indicates that current cyclic TTA is not well on stopping the iteration at the right time. Without limiting the number of cyclic iteration (the maximum number of iteration), we see that sometimes the image is nearly destroyed, losing much of its features. This indicates that if the loss prediction pipeline is not perfectly well-functional, presence of further unwanted corruption is possible. These indicate that maximum number of iteration parameter should be proportional to the wellness of the loss predictor and additional exit signals should be required to prevent additional unwanted corruptions. % maybe put some graphical figure here
%Thus, we can infer that additional exit signals other than choosing the identity transformation and limiting the maximum number of iteration should be required, unless the loss predictor achieves much better accuracy. 


From our experiment, we have observed that well functional loss predictor contributes to even better performance in the cyclic TTA pipeline than in our baseline. On the other hand, the poor performance leads to even more deteriorating result in the cyclic TTA performance, which refers that the accuracy of the loss predictor can lead to more drastic reflection to the performance in the cyclic maneuver. With such observation, it is evident that, the key factor to reach the cyclic Oracle-TTA performance is to train a loss predictor with high accuracy, which involves finding a suitable transformations candidates those are well learnable by the loss predictor and finding a suitable training configuration for the loss predictor. %training method learnable cyclic continuity of the loss predictor usage relates to  % In addition to such observation, we have analyzed the number of transformations used for the cyclic TTA for each type of corruption, finding that % we see that certain transformations are not well learnable? by comparing the results to the oracle

% 

From the EWM performance difference in the clean and the corrupted condition, we suggest that the measurement of the data uncertainty is more evident in corrupted condition for the given target network. Considering the data uncertainty should be extracted from multiple number of the target network predictions \citep{Uncertainty_Ensemble}, it is possible that the calculated entropy could not have reflected the data uncertainty to an accurate level. From examining the entropy values from ``10 crops" case, while the clean data generated relatively uniform entropy values among the augmentations, in the corrupted case, often outlying entropy values was found, which refers to the uncertain augmentations. This indicates that while such data uncertainty reflection could be effective in case of evident distortion in the input image, more precise and accurate measurement of the uncertainty should be required to take the advantage in clean condition of the data.



%------------------------------------------------------------------------

\section{Conclusion}
\label{sec:Conclusion}

% In this work, we have introduced the cyclic modification of the loss prediction pipeline to implement flexible transformations to the input image and implementation of EWM for TTA policy. We state that iterations to find a suitable condition of a corrupted image can be considered as part of iterative optimization process, and able to restore part of its original quality for network prediction. The loss predictor learns the implicit features of the corrupted condition of the image to to predict the most suitable transformation to result in better performance for the target network. Our main contribution is to suggest cyclic loss prediction pipeline to expand the transformation space of the input image and the upper bound of the loss prediction pipeline via achieving the flexibility of the transformations.

In this work, we have introduced the cyclic modification of the loss prediction pipeline to implement flexible transformations to the input image and the implementation of EWM for TTA policy. Given that the loss predictor learns the implicit features of the corrupted condition of the image to predict the most suitable transformation, we state that the multiple iterations to find the suitable condition of the corrupted image can be considered as a part of iterative optimization process, and able to restore part of its original quality for network prediction. Our main contribution is to suggest that the cyclic loss prediction pipeline can expand the transformation space of the input image and the upper bound of the loss prediction pipeline via achieving the flexibility of the transformations.


For EWM, we show that direct reflection of data uncertainty could be effective against the corrupted condition of data. As augmentations are given, each of them can contribute with variable weights, for their importance for network prediction are different. 

% meaningful in that, we suggest multiple trans with flexible magnitude. Assuming that the loss predictor can be trained perfectly, could be awesome!

Although we have suggested that such a pipeline holds much potential for performance improvement, there is much gap from the ideal Oracle-TTA performance. Therefore, our future work will be of configuring and training the loss predictor with high performance. As for the transformation candidates, even though we have used a similar set of predefined transformations to our baseline, in order to search for a better condition of the input image, it is possible for more transformations with a wider range of magnitude are more suitable. Thus, we plan to experiment with generative models to restore the corrupted condition with respective to the target network. We expect to proceed the transformations without setting the predefined set in future works.

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option

    This work was conducted by Center for Applied Research in Artificial
Intelligence(CARAI) grant funded by Defense Acquisition Program Administration(DAPA) and
Agency for Defense Development(ADD) (UD190031RD).
\end{acknowledgements}

\bibliography{chun_45}



\end{document}
