\documentclass[accepted, table]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{booktabs}

%%%%%%% packages
% \usepackage{algorithm, algpseudocode}
\usepackage{algorithmic}
\usepackage{adjustbox}
\usepackage[ruled,vlined]{algorithm2e}
\usepackage{times}
\usepackage{multirow}
\usepackage{epsfig}
\usepackage{chngpage}
\usepackage{mathtools}
\usepackage{caption}
\captionsetup[table]{skip=10pt}
% \usepackage[table,xcdraw]{xcolor}
% \usepackage[table]{xcolor}
%\usepackage[section]{placeins}
\usepackage{listings}
\usepackage{subcaption}
\usepackage{sidecap}

\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
% \usepackage{xcolor}
\definecolor{textblue}{rgb}{.2,.2,.7}
\definecolor{textred}{rgb}{0.54,0,0}
\definecolor{textgreen}{rgb}{0,0.43,0}
% \usepackage{listings}
\lstset{language=Python, 
numbers=left, 
numberstyle=\tiny, 
stepnumber=1,
tabsize=4,
basicstyle=\fontsize{9}{11}\selectfont\ttfamily,
numbersep=5pt, 
keywordstyle=\color{textblue},
commentstyle=\color{textred},   
stringstyle=\color{textgreen},
frame=none,                    
columns=fullflexible,
keepspaces=true,
xleftmargin=\parindent,
showstringspaces=false}

\title{ViBid: Linear Vision Transformer with Bidirectional Normalization}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,*]{Jeonggeun Song}
\author[1,*]{\href{mailto:<andrew.com@kakaoenterprise.com>?Subject=ViBid}{Heung-Chang Lee}{}}
% Add affiliations after the authors
\affil[1]{%
    AI Lab \& Service\\
    Kakao Enterprise\\
    Seongnam-si, South Korea
}
\affil[*]{%
    Equal Contributions
}
  
  \begin{document}
\maketitle

%%%%%%%%% ABSTRACT
\begin{abstract}
  The vision transformer has achieved state-of-the-art performance in various vision tasks; however, the memory consumption is larger than those of previous convolutional neural network based models because of $O(N^2)$ time and memory complexity of the general self-attention models. Many approaches aim to change the complexity to $O(N)$ to solve this problem; however, they stack deep convolutional layers to retain locality or complicate the architecture as seen in window attention, to compensate for the performance degradation. To solve these problems, we propose ViBid algorithm, which resolves the complexity problem of $O(N^2)$ by replacing Softmax with bidirectional normalization (BiNorm). In addition, it has a much simpler architecture than the existing transformer model with $O(N)$ complexity. Owing to our simple architecture, we were able to use larger resolutions for training, and we obtained a lighter and superior GPU throughput model with competitive performance. ViBid can be used with any transformer method that uses queries, keys, and values ($QKV$) because of BiNorm, and it is quite universal due to its simple architectural structure.
\end{abstract}

%%%%%%%%% BODY TEXT
\section{Introduction}
\label{introduction}
Transformers have been used in various fields. Initially, they were mostly employed in natural language processing (NLP)~\citep{dosovitskiy2020image, touvron2020training, wu2021cvt, srinivas2021bottleneck, heo2021rethinking, graham2021levit, el2021xcit}, but currently, transformers are used in many domains of vision \citep{dosovitskiy2020image, touvron2020training, jiang2021transgan, esser2021taming, durall2021combining}. The transformer has achieved state-of-the-art performance on several benchmark datasets. In the early stages, the vision transformer splits the input into patch units and then learns the image features after securing the locality with the convolutional layer. In this process, general transformer models which have an $O(N^2)$ complexity, for the number of tokens $N$, prohibit the size of the model parameters from growing excessively by using deeper convolutional layers or decreasing the size of the input sent to the transformer.


This is because the size of the model parameters increases as the square of the input size, that is, the size of the token, increases. Our proposed algorithm dramatically reduces the complexity of $O(N^2)$ to $O(N)$ by changing Softmax, which is the most commonly used function, to bidirectional normalization (BiNorm) and changing the multiplication order of the query, key, and value ($QKV$). This allows to stack the transformer module deeper and use a higher resolution as the input
because the number of model parameters does not increase owing to the increased token size. Because $L_2$-normalization operates in distinct directions on the channel axis of $Q$ and the spatial axis of $K^TV$, BiNorm is defined as bidirectional normalization.


One of the most essential aspects of the proposed method is that it has the simplest architecture among $O(N)$ complexity transformer algorithms. Contrastingly, existing methods for reducing complexity have resulted in performance degradation. To compensate, additional modules were added to the models, resulting in a complicated architecture, as seen in Figure~\hyperref[fig:comp_att_wa]{1(b)}, ~\hyperref[fig:comp_att_pa]{1(c)} and ~\hyperref[fig:comp_att_ka]{1(d)}. However, ViBid is a linear transformer with $O(N)$ complexity and has an extremely simple architecture that does not require additional modules to supplement performance. Consequently, our suggested approach can be used with any transformer algorithm that has $QKV$ and for any vision-related tasks.

\begin{figure*}[t!]
\adjustbox{valign=t}{
\begin{minipage}[t]{0.65\textwidth}
    \begin{subfigure}[b]{0.32026144\textwidth}
      \centerline{\includegraphics[width=\linewidth]{figures/overall_workflow/self_attention.pdf}}
      \caption{Self-attention.}
      \label{fig:comp_att_sa}
    \end{subfigure}
    \begin{subfigure}[b]{0.30065359\textwidth}
      \centerline{\includegraphics[width=\linewidth]{figures/overall_workflow/window_attention.pdf}}
      \caption{Window attention.}
      \label{fig:comp_att_wa}
    \end{subfigure}
    \begin{subfigure}[b]{0.29738562\textwidth}
      \centerline{\includegraphics[width=\linewidth]{figures/overall_workflow/pattern_attention.pdf}}
      \caption{Pattern-based attention.}
      \label{fig:comp_att_pa}
    \end{subfigure}\\
    \\
    \begin{subfigure}[b]{0.42892157\textwidth}
      \centerline{\includegraphics[width=\linewidth]{figures/overall_workflow/kernel_attention.pdf}}
      \caption{Kernel-based attention.}
      \label{fig:comp_att_ka}
    \end{subfigure}\hspace{0.03\textwidth}
    \begin{subfigure}[b]{0.46650327\textwidth}
      \centerline{\includegraphics[width=\linewidth]{figures/overall_workflow/binorm_attention.pdf}}
      \caption{BiNorm-based attention. (Ours)}
      \label{fig:comp_att_binorm}
    \end{subfigure}
\end{minipage}
}
\adjustbox{valign=t}{
\begin{minipage}[t]{0.35\textwidth}
\caption{\textbf{Comparison with existing self-attention algorithms:}
(a) \textbf{Self-attention:} The query and key generate $N\times N$ attention maps, resulting in an $O(N^2)$ operation.
(b) \textbf{Window attention:} This method splits input images into several windows before computing self-attention. While it avoids the $O(N^2)$ problem of self-attention, it also slows down computation due to a fixed window size.
(c) (d) \textbf{Efficient self-attention:} These methods choose a fixed number of tokens to generate attention maps. However, these procedures are heuristic and complicated, and some of them require operations not supported by common frameworks.
(e) \textbf{Ours:} BiNorm-based self-attention is not as complicated and has an $O(N)$ complexity since it does not require any additional processing.}
\end{minipage}
}
\label{fig:comparison_attention}
\end{figure*}


The contributions of our algorithm can be summarized as follows.
\begin{itemize}
    \item Our proposed algorithm facilitates the building of an efficient architecture, even when the token size(resolution) increases, by improving the complexity from $O(N^2)$ to $O(N)$.
    \item The sequence in which $QKV$ is multiplied varies even within the same $O(N)$ model, and our proposed model has a simpler architectural structure than other models with the same $O(N)$ complexity; therefore, it is more effective for model parameters.
    \item When the resolution (token) must be large, most ViT designs tend to reduce the final input to the transformer by deeply stacking the early convolutional layer stage. However, because our algorithm is not burdened by large-sized inputs, it may be learned using a transformer without significantly reducing its size in the early convolutional layer stage.
\end{itemize}

\section{Related Works}
\label{related_works}
\paragraph{Vision transformers.} Dosovitskiy et al.~\citep{dosovitskiy2020image} proposed a
vision transformer (ViT), which demonstrated the use of transformer-based models for vision tasks. After the achievements of ViT, DeiT~\citep{touvron2020training} introduced data-efficient training strategies for vision transformers with detailed ablation studies. They solved the ViT data efficiency problem successfully, and most of the current transformer-based models follow their schemes.

In further research, various architectures based on transformer variants have been presented. Touvron et al.~\citep{touvron2021going} proposed two simple types of modules. One is the class attention module, which is the additional SA layer used to extract class information. These layers help the model aggregate features from the last outputs. The other is the LayerScale modules. These are learnable parameters for scaling residual connections. This prevents larger models from being overfitted. A simple variant of the LayerScale was presented at ResMLP~\citep{touvron2021resmlp}. 
While MLP-based models are irrelevant to our model, we apply Affine modules to our model as scalers.

Liu et al.~\citep{liu2021swin} proposed a shifting window and patch merging. This generates local attention using two types of windows: Normal windows and shifted windows. At the end of each stage, this method merges the patches to preserve large receptive fields without heavy computation. Swin Transformer is organized in a hierarchical structure. The Swin Transformer alters the image resolution as the layers deeper, similar to how CNN reduces the resolution of the input image as the layers deepen. Feature Pyramid Network (FPN) structure of object detection can be employed since they have varied scale information. The performance in object detection and segmentation tasks is invariably superior to ViT because it takes advantage of multi-scale information via the FPN structure.

\paragraph{Hybrid architectures.} Various methods for integrating convolutional layers~\citep{heo2021rethinking, wang2021pyramid, graham2021levit, el2021xcit, xiao2021early, hassani2021escaping} instead of searching for new spatial structures have been introduced. LeViT, designed by Graham et al.~\citep{graham2021levit}, applies multi-stage networks to transformers using SA with convolution and pooling layers. Xiao et al.~\citep{xiao2021early} found that replacing linear patch embedding layers with convolutions helps transformers better capture low-level features. This is very similar to the stemming stage of existing CNN networks. El-Nouby et al. introduced local patch interactions in XCiT~\citep{el2021xcit}. With two depthwise convolutions~\citep{chollet2017xception} added after XCA, XCiT achieved better performance. Our models are generally inspired by the intrinsic optimization strategies that XCiT introduced, while we present our own SA method.


\paragraph{Efficient self-attention.} 
Instead of architectural strategies, several approaches have been proposed to solve the $O(N^2)$ problem of the self-attention (SA) mechanism. They are classified into several categories: those that use their own spatial patterns~\citep{ho2019axial, child2019generating, sukhbaatar2019adaptive}, linear approximation by sampling important tokens~\citep{kitaev2020reformer, xiong2021nystr}, various low-rank factorization methods~\citep{choromanski2020rethinking, shen2021efficient, wang2020linformer}, and local attention~\citep{liu2021swin}.


However, these approaches have issues beyond complexity. The pattern and sampling methods are difficult to implement, and the GPU efficiency is low because a dynamic graph has to be created each time. The low-rank factorization method has the disadvantage of being a human heuristic, in which a person must empirically decide the kernel function. Local attention has a complex architecture, and the Swin transformer~\citep{liu2021swin}, which is used as an example, has the disadvantage of not having a simple architecture, such as using it as input through split windows.

These approaches, however, have issues beyond the complexity. The pattern and sampling methods were not easy to implement, and gpu efficiency was low because a dynamic graph had to be created each time. The low-rank factorization method has the disadvantage of being human heuristic, in which a person has to empirically decide the kernel function. The local attention has a complex architecture, and the Swin Transformer~\citep{liu2021swin}, which is used as an example, has the disadvantage of not being a simple architecture, such as using it as input through split windows.


\section{Method}\label{method}
For the SA algorithm of the transformer~\citep{vaswani2017attention}, the query ($Q$) and key ($K$) are multiplied first to compute every pairwise relation of the tokens. The multiplication has time and memory complexity that is quadratic to the number of tokens. If the matrix multiplication of $K$ and $V$ is computed first, the computational resource of the SA is reduced to $O(N)$. However, the Softmax function must be applied to key-query interactions to generate a probability distribution for the attention mechanism. Softmax is a nonlinear operation; therefore, it must be removed from the SA to change the order of matrix multiplication.


To determine the effect of removing Softmax, we experimented with the ImageNet1k classification task for the Softmax-free ViT models. It implied that these models did not employ probabilistic approaches to SA. Interestingly, it was found that removing Softmax had no effect on the performance of the ViT models, as hypothesized. However, if Softmax was eliminated, the models were trained slowly and unstably during the early epochs of training. Furthermore, when additional experiments were conducted for other architectural optimizations, such as convolutional modules, the performance decreased. We proposed BiNorm method and integrated it with several existing architectural strategies to address these issues. This section explains how our proposed strategy generates stable Softmax-free SA models while avoiding quadratic complexities.


\begin{figure}[t]
\centering
\includegraphics[width=0.73\linewidth]{figures/vibid_model.pdf}
\caption{\textbf{ViBid model.} Our proposed ViBid model consists of a BiNorm attention module, two $3\times 3$ separable convolution layers, and a feedforward layer. Note that LayerNorm and Affine layers are omitted for simplicity.}
\label{fig:vibid_module}
\end{figure}


\begin{figure*}[t]
\adjustbox{valign=t}{
\begin{minipage}[t]{0.65\textwidth}
  \begin{subfigure}[b]{0.3\linewidth}
    \centering
    \centerline{\includegraphics[width=\linewidth]{figures/mockup_attn/sa_attn_map.pdf}}
    \caption{Normal SA.}
    \label{fig:sa_attn_map}
  \end{subfigure}
  \begin{subfigure}[b]{0.3\linewidth}
    \centering
    \centerline{\includegraphics[width=\linewidth]{figures/mockup_attn/naive_attn_map.pdf}}
    \caption{Softmax-free SA.}
    \label{fig:softmax_free_attn_map}
  \end{subfigure}
  \begin{subfigure}[b]{0.3\linewidth}
    \centering
    \centerline{\includegraphics[width=\linewidth]{figures/mockup_attn/binorm_attn_map.pdf}}
    \caption{BiNorm-based SA.}
    \label{fig:binorm_attn_map}
  \end{subfigure}
\end{minipage}
}
\adjustbox{valign=t}{
\begin{minipage}[t]{0.32\textwidth}
  \caption{\textbf{Attention maps in the early epochs.} (a) \textbf{Normal SA.} (b) \textbf{Softmax-free SA.} If the softmax is eliminated without replacement, it is heavily biased to specific patches. (c) \textbf{BiNorm-based SA.} BiNorm makes the output vectors unit-sized to debias the attention maps.}
\end{minipage}
}
\label{fig:attn_maps}
\end{figure*}
% SA: O(N^2d) / Window: O(k^2Nd), k=window_size / Pattern: O(kNd), k=sampling_size / Kernel: O(kNd), k=whole_kernel_size / BiNorm: O(kNd), k=num_qk_channels, d=num_value_channels


\subsection{BiNorm}
\label{sec:BiNorm}
The pixel-to-pixel relationship which can be obtained in basic SA methods as $\text{Softmax}(QK^T)$ is calculated as a dot product. Softmax smooths the range of output vectors from 0 to 1 for obtaining an attention map, as shown in Figure~\hyperref[fig:sa_attn_map]{2(a)}. When Softmax is removed from the $QKV$, its output has a normal distribution from the initialization. However, in the case of a normal distribution, it has a range of [-inf, inf]. This makes the initial value more biased for a specific pixel without Softmax, as shown in Figure~\hyperref[fig:softmax_free_attn_map]{2(b)}.


Considering the previous insight, we conclude that the primary priority of the Softmax function is not to construct a probability distribution. Its sole purpose is to limit the output range of the function. Consequently, we determined that the Softmax function is not essential in ViT. We propose BiNorm, which is a combination of two $L_2$-normalizations, applied bidirectionally to the spatial dimension of $K^{T}V$ and channel dimension of $Q$. It can make the SA have a complexity of $O(N)$ with a few lines of modification. Let $\textbf{x}\in \mathbb{R}^{b\times N \times d}$ be the input image, where $b$ is the batch size, $N$ is the number of tokens, and $d$ is the number of channels. Then, BiNorm-based SA is defined as:
\begin{align}\label{eq:BiNorm}
    &Q=W_{Q}\textbf{x}, K=W_{K}\textbf{x}, V=W_{V}\textbf{x}\\
    &(Q\in \mathbb{R}^{N\times d_{q,k}}, K\in \mathbb{R}^{N\times d_{q,k}}, V\in \mathbb{R}^{N\times d_{v}})\\
    &\text{SA}(\textbf{x})=\text{BiNorm}(Q, K^{T}V)\\
    &(Q\in \mathbb{R}^{N\times d_{q,k}}, K^{T}V\in \mathbb{R}^{d_{q,k}\times d_{v}})\\
    &\text{BiNorm}(A, B)=[L_{2}(A)_{\text{dim}=2}]^{T}L_{2}(B)_{\text{dim}=1}
\end{align}

for arbitrary matrices $A\in\mathbb{R}^{b\times N\times d}$ and $B\in\mathbb{R}^{b\times N\times d}$. BiNorm consists of two simple $L_2$- normalizations that apply to the channel dimension of $Q$ and the spatial dimension of $K^{T}V$.


The output vectors of BiNorm-based attention are limited to unit size. All vectors have the same weight during the attention mechanism. Therefore, because $Q$ and $K$ are calculated as a unit vector by $L_2$-normalization, BiNorm generates a smoothed attention map that differs from Softmax-free, as shown in Figure~\hyperref[fig:binorm_attn_map]{2(c)}. Mathematically, it is a cosine similarity matrix of $Q$ and $K^{T}V$ that generates clearer relations. Empirically, the ViT models with BiNorm converged faster than those without BiNorm. Additionally, the performance of BiNorm-based models did not decrease when other architectural optimizations were added to the models.


\subsection{Comparison of Computational Complexity}
The original SA has a complexity of $O(N^2)$ when computing $QK^T$.
\begin{align}\label{eq:sa}
    &Q=W_{Q}\textbf{x}, K=W_{K}\textbf{x}, V=W_{V}\textbf{x}\\
    &(Q\in \mathbb{R}^{N\times d_{q,k}}, K\in \mathbb{R}^{N\times d_{q,k}}, V\in \mathbb{R}^{N\times d_{v}})\\
    &\text{SA}(\textbf{x})=\frac{\text{Softmax}(QK^T)}{\sqrt{d}}V\\
    &(QK^T\in \mathbb{R}^{N\times N}, V\in \mathbb{R}^{N\times d_{v}})
\end{align}

If the order of matrix multiplication is changed to sequential order, using BiNorm, both $K^{T}V$ and $Q(K^{T}V)$ have a complexity of $O(N)$ (see Equation~\ref{eq:BiNorm} for details). For vision tasks, the number of tokens is proportional to the resolution and reciprocal of the patch size. For example, if the height and width of an input image are scaled to $2\times$, the original SA requires $16\times$ computational resources. This is not efficient for cases require high-resolution inputs, such as the compound scaling method at EfficientNet. ~\citep{tan2019efficientnet}


In previous studies, various methods have been proposed to make self-attention $O(N)$ complexity (see Section~\hyperref[related_works]{2} for further information). Most of these reduce the tokens that generate attention maps by utilizing local functions, learnable kernel functions, or human-designed patterns. However, they have several limitations. Primarily, they rely heavily on human heuristics. When the entire
workflow is altered to some degree, new heuristics are required for the entire workflow. Further, they frequently require specialized operations that are not generally supported. This implies that they may be difficult to optimize for different tasks, frameworks, and devices. Finally, they confuse the overall flow. Many machine-learning devices have been designed for dense operations. A complicated computational graph may result in redundancy and memory leakage when it is used to operate on them.


BiNorm-based algorithms can reduce the complexity from $O(N^2)$ to $O(N)$ by modifying a few lines of the code. As depicted in Figure~\hyperref[fig:comparison_attention]{1}, the computational graph of BiNorm-based attention is not complicated compared with the original SA. Because of its simple structure, our module is much more efficient at GPU than most other SA algorithms. We should discuss the numerical analysis of the computational efficiency in Section~\hyperref[experiments]{4}. Our proposed method consumes the least GPU memory and has the highest GPU throughput on a similar scale of FLOPs and the size of the model parameters.


\subsection{ViBid Model}
As shown in Figure~\ref{fig:vibid_module}, the input images are passed through convolutional patch embedding layers and divided into $16\times 16$ patches. The convolutional patch embedding layers outperform the linear patch embedding layers in terms of the model performance. ViBid module consists of BiNorm-based SA, two $3\times 3$ separable convolution layers, and a feedforward module. Unlike other models, BiNorm-based attention adopts a bottleneck design. The embedding size $h$ is smaller than the output channel size $d$. In our experiments, a smaller embedding size prevented overfitting of the model. Locality is used as a weak inductive bias by convolutional layers that use relatively fewer resources. We adopted the class attention layers proposed by Touvron~\citep{touvron2021going} but used BiNorm-based class attention layers differently from the existing ones. Table~\ref{tb:design} presents the model design used in our experiments.


\begin{table}[t!]
\centering
\caption{\textbf{Design of ViBid models.} The architectural parameters contain the depth of model, the output dimension of each model $d$, the size of embedding $h$, and the number of heads.}
\begin{tabular}{l|c|c|c|c}
\hline
\rowcolor[HTML]{EFEFEF}
Model & depth & $d$ & $h$ & \#heads\\
\hline
ViBid-U & 12 & 192 & 96 & 4 \\
ViBid-T & 24 & 192 & 96 & 4 \\
ViBid-S & 12 & 384 & 128 & 8 \\
ViBid-M & 24 & 384 & 128 & 8 \\
ViBid-B & 24 & 512 & 128 & 8 \\
\hline
\end{tabular}
\label{tb:design}
\end{table}


\begin{table}[t]
\centering
\caption{\textbf{The results of fine-tune at higher resolutions.} Our models show the fastest GPU throughput and the lowest peak memory in comparison to the other models which accomplish similar performance. Note that XCiT~\citep{el2021xcit} models use $224\times 224$ resolution, since they use smaller patch size.} 
\begin{tabular}{l|c|c|c|c}
\hline
\rowcolor[HTML]{EFEFEF}
 & Top-1 & FLOPs & & GPU Thr. \\
\rowcolor[HTML]{EFEFEF}
\multirow{-2}{*}{Model} & Acc. & (G) & \multirow{-2}{*}{Res.} & (img/s) \\ \hline
EfficientNet-B7~ & 84.3 & 37.0 & 600 & 53.6 \\
XCiT-S24/8~ & 83.9 & 36.0 & 224 & 106.5 \\
XCiT-M24/8~ & 83.7 & 63.9 & 224 & 69.5 \\
DeiT-B~ & 83.1 & 49.4 & 384 & 87.8 \\
Swin-B~ & 84.5 & 47.0 & 384 & 86.5  \\ \hline
\rowcolor[HTML]{EFEFEF}
ViBid-M & 83.8 & 20.5 & 384 & 171.2 \\
\rowcolor[HTML]{EFEFEF}
ViBid-B & 84.5 & 35.1 & 384 & 114.9 \\
\rowcolor[HTML]{EFEFEF}
ViBid-B & 84.7 & 62.4 & 512 & 66.6 \\
\rowcolor[HTML]{EFEFEF}
ViBid-B & 84.8 & 140.4 & 768 & 28.6 \\ \hline
\end{tabular}
\label{tb:hires_results}
\end{table}


\section{Experiments}
\label{experiments}
%%%%%%%%%%%%%%%%%%%%%%
%%%%%% Imagenet dataset
%%%%%%%%%%%%%%%%%%%%%%

\subsection{Image Classification}
\begin{figure}[t!]
\centering
    \begin{subfigure}[b]{0.49\textwidth}
      \centerline{\includegraphics[width=\linewidth]{figures/acc/flops_acc_comparison.png}}
      \caption{FLOPs vs. ImageNet Top-1 Acc.}
    \end{subfigure}
    \begin{subfigure}[b]{0.49\textwidth}
      \centerline{\includegraphics[width=\linewidth]{figures/acc/params_acc_comparison.png}}
      \caption{Params vs. ImageNet Top-1 Acc.}
    \end{subfigure}
  \caption{\textbf{Comparison with the transformer-based vision models.} Our models show superior performance at most regime of FLOPs and param size. More details containing the comparison with CNN models are in Table~\ref{tb:imagenet_comparison}.}
  \label{fig:flops_params_acc_comparison}
\end{figure}

\paragraph{Implementation details.} For the image classification task, we evaluate our models using the ImageNet1k~\citep{deng2009imagenet} dataset which spans 1000 semantic classes. It contains 1,281k images for training and 50k images for validation. There is no additional labeled or unlabeled dataset used. We train our model for 400 epochs with the AdanW optimizer~\citep{loshchilov2017decoupled}. Following the linear scaling rule~\citep{you2017large}, the learning rate is scaled by $b/512$ for batch size $b$. It warms up linearly for the first 5 epochs before decaying using a cosine schedule. LayerNorm~\citep{ba2016layer} and Affine~\citep{touvron2021resmlp} are used in each residual block to improve generalization. As strong regularization, our proposed method utilize RandAugment, stochastic depth~\citep{huang2016deep}, and CutMix~\citep{yun2019cutmix} for data-efficient training. The size of each model affects the amplitude of RandAugment~\citep{cubuk2020randaugment} and the probability of dropping residual connections. To improve the training enough for the larger models, stronger regularization is required. We do not employ distilled knowledge from a pre-trained instructor model to boost performance. All training procedures are performed on 32 NVIDIA A100 GPUs.

\begin{figure}[t!]
\centering
\begin{subfigure}{0.48\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figures/memory/allocated_memory_resolution_comparison.png}
    \caption{Allocated memory according to the input resolution}
    \label{fig:memory_consumption}
\end{subfigure}
\hspace{0.005\textwidth}
\begin{subfigure}{0.48\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figures/gpu_throughputs/gpu_throughputs_resolution_comparison.png}
    \caption{GPU throughput according to the input resolution}
    \label{fig:resolution_throughput_comparison}
\end{subfigure}
\caption{\textbf{Comparison of consumption of computational resources at high resolution.} (a) The results of the peak GPU memory measured on different resolutions. Our models require significantly less memory than the other models at overall resolutions. (b) The results of the GPU throughput measured on varying resolutions. GPU throughput axis is $\log_2$-scaled. Note that the GPU throughput of ViBid models is more slowly decreased as the resolution increases.}
\label{fig:comparison_consumption_hires}
\end{figure}

\begin{table*}[t]
\centering
\caption{\textbf{Comparison with the concurrent models.} The image classification results include the top-1 accuracy, param size, FLOPs, and GPU throughput of various models on ImageNet1k. Our models show competitive results for top-1 accuracy, and show the fastest GPU throughput among models which achieve similar performance.}
\begin{tabular}{l|c|c|c|c|c}
\hline
\rowcolor[HTML]{EFEFEF}
 & Top-1 & Params & FLOPs & GPU Throughput & GPU Throughput \\
\rowcolor[HTML]{EFEFEF}
\multirow{-2}{*}{Model} & Acc. & (M) & (G) & (img/s, res=224) & (img/s, res=768) \\ \hline
RegNetY-800MF~\citep{radosavovic2020designing} & 76.3 & 6 & 0.8 & 1642.2 & 145.0 \\
RegNetY-1.6G~\citep{radosavovic2020designing} & 78.0 & 11 & 1.6 & 932.0 & 103.1 \\
DeiT-Ti~\citep{touvron2020training} & 72.2 & 5 & 1.3 & 2390.3 & 70.4 \\
\rowcolor[HTML]{EFEFEF}
ViBid-U & 76.3 & 6 & 1.0 & 1177.7 & 163.8 \\
\rowcolor[HTML]{EFEFEF}
ViBid-T & 78.8 & 10 & 1.9 & 650.3 & 94.4 \\ \hline

ResNet-50~\citep{he2016deep} & 75.3 & 26 & 3.8 & 1097.0 & 104.7 \\
RegNetY-4G~\citep{radosavovic2020designing} & 80.0 & 21 & 4.0 & 837.2 & 33.9 \\
DeiT-S~\citep{touvron2020training} & 79.8 & 22 & 4.6 & 892.5 & 31.7 \\
Swin-T~\citep{liu2021swin} & 81.3 & 29 & 4.5 & 729.0 & - \\
XCiT-S12/16~\citep{el2021xcit} & 82.0 & 26 & 4.8 & 678.4 & 52.8 \\
CoAtNet-0~\citep{dai2021coatnet} & 81.6 & 25 & 4.2 & - & - \\
PVTv2-B2~\citep{wang2022pvt} & 82.0 & 25 & 4.0 & - & - \\
MViTv2-T~\citep{li2022mvitv2} & 82.3 & 24 & 4.7 & - & - \\
\rowcolor[HTML]{EFEFEF}
ViBid-S & 82.0 & 21 & 3.7 & 832.2 & 61.6 \\ \hline

ResNet-101~\citep{he2016deep} & 76.4 & 47 & 7.6 & 657.1 & 63.1 \\
RegNetY-8G~\citep{radosavovic2020designing} & 81.7 & 39 & 8.0 & 477.5 & 27.8 \\
Swin-S~\citep{liu2021swin} & 83.0 & 50 & 8.7 & 409.1 & - \\
XCiT-S24/16~\citep{el2021xcit} & 82.6  & 48 & 9.1 & 369.3 & 31.4 \\
\rowcolor[HTML]{EFEFEF}
ViBid-M & 82.8 & 37 & 7.0 & 465.8 & 38.6 \\ \hline

RegNetY-16G~\citep{radosavovic2020designing} & 82.9 & 84 & 16.0 & 317.3 & 16.3 \\
DeiT-B~\citep{touvron2020training} & 81.8 & 86 & 17.5 & 303.4 & 13.1 \\
Swin-B~\citep{liu2021swin} & 83.5 & 88 & 15.4 & 274.6 & 21.3 \\
XCiT-M24/16~\citep{el2021xcit} & 82.9 & 84 & 16.2 & 249.0 & 21.5 \\
\rowcolor[HTML]{EFEFEF}
ViBid-B & 83.3 & 64 & 11.9 & 330.8 & 28.6 \\ \hline
\end{tabular}
\label{tb:imagenet_comparison}
\end{table*}

\paragraph{Comparison with the concurrent models.}
In Table~\ref{tb:imagenet_comparison}, we compare our models with existing transformer-based and CNN models. Our models achieved higher performance than the other models at a similar scale of FLOPs and parameters (refer to Figure~\ref{fig:flops_params_acc_comparison}). Our models can perform well with fewer computational resources and lower capacities, even they does not utilize the architectural optimizations for vision like local self-attention or multi-scale structure. As a metric of performance, we measured the GPU throughput of each model. In particular, at a resolution of $768\times 768$, the GPU throughput of our models surpassed the CNN models as well as the other transformer-based models. While our proposed method computes global spatial relations, our models show superior performance at various resolutions compared with CNN models that utilize local relations.


\paragraph{Fine-tune at higher resolution.} Instead of training the models from scratch, we fine-tuned ViBid-M and ViBid-B at a higher resolution for 10 epochs. We report the results of fine-tuning at resolutions of 384, 512, and 768. The batch sizes for each training session were set to 1024, 512, and 256. Owing to the benefit of lower memory consumption, our models can be trained faster by utilizing a large batch size, whereas the capable computational resources are limited.


Our models showed higher performance than DeiT-B and Swin-B at the same resolution. In addition, the GPU throughput was not reduced much at a higher resolution than that of the other models. Moreover, our models allocated a much smaller amount of memory. DeiT-B trained at a resolution of $384\times 384$ possesses 20\% more memory than ViBid-B trained at a resolution of $512\times 512$, even if DeiT-B uses half the number of tokens that ViBid-B uses. As the resolution increases, fine-tuning offers a boost without increasing the model capacity. This implies that the model learns high-resolution features without additional parameters.

\subsection{Ablation Study}
\paragraph{The effect of the convolutional layers.} We conducted an experiment to compare the performance of BiNorm with a pure architecture by subtracting the depth-wise convolutional layer, which is Local Patch Interactions (LPI) proposed at XCiT~\citep{el2021xcit}. We compared ViBid-S(w/o conv) and DeiT-S~\citep{touvron2020training} in the ImageNet1k dataset because our architecture is essentially the same as DeiT, except for BiNorm. In conclusion, ViBid-S(w/o Conv) performed better than DeiT-S, which has an accuracy of 79.8\%, with 80.3\%. Undoubtedly, adding LPI results in better performance. However, the role of BiNorm is proven to be accurately represented because it is better than DeiT and higher than XCiT when LPI is present.

\paragraph{Comparison of the activation functions.} As mentioned in Sec.~\hyperref[method]{3}, we compared Softmax-free ViT~\citep{dosovitskiy2020image}, the original ViT, and ViBid-S without LPI to observe the main role of Softmax function. The results are in Table~\ref{tb:ablation_softmax_free}. In our tests, Softmax-free ViT models perform a little worse than the original ViT models. Although the divergence does not appear to be as large as we had anticipated, Softmax-free models did not converge well in the early epochs. (Refer to our supplementary materials.)

We trained the ViBid-S model using the same techniques as BiNorm to confirm its advantages. In comparison to Softmax-free and Softmax models, our models perform better. Empirically, those findings support both our theory regarding the primary function of Softmax in the original self-attention and the necessity of BiNorm as a replacement for Softmax.



\subsection{Measuring Computational Efficiency}
As denoted in Section~\hyperref[method]{3}, BiNorm-based self-attention can be a useful solution for transformer-based vision models when high resolution features are required. To perform quantitative analysis, we report the required computational resources on various resolutions for different vision models and our proposed models. (See Figure~\ref{fig:comparison_consumption_hires}.) All measurements were performed on a NVIDIA V100 GPU with batch size $b=32$.

\paragraph{Memory efficiency.} In terms of capability for both training and inference, memory efficiency is one of the most important factors. As shown in Figure~\ref{fig:memory_consumption}, the models based on BiNorm consumed much less memory for larger resolutions compared to the other models, which are based on the original self-attention or the local attention. It demonstrates our BiNorm-based self-attention scheme works on high-resolution inputs more efficiently, even compared to the local attention algorithms such as Swin~\citep{liu2021swin}. Our model can process up to a $4\times$ batch size compared with the other models showing similar performance. Another advantage of the proposed methods is that our models can be easily scaled up without concerning the growth of memory usage. As depicted in Figure~\ref{fig:memory_consumption}, the allocated memory of our models does not increase much whereas the size of the model grows up. It allows the training at a large scale with reasonable computational resources.


\begin{table}[t]
\centering
\caption{\textbf{Ablation study about the effect of Softmax.} For a fair comparison, we implemented and trained ViT models again. Note that ViT-B without Softmax performs the matrix multiplications sequentially for self-attention.}
\begin{tabular}{c|c|c|c}
\hline
\rowcolor[HTML]{EFEFEF}
 & & Activation & Top-1 \\
\rowcolor[HTML]{EFEFEF}
\multirow{-2}{*}{Model} & \multirow{-2}{*}{Complexity} & Function & Acc. \\
\hline
ViT-B & $O(N)$ & None & 78.6 \\
ViT-B & $O(N^2)$ & Softmax & 78.8 \\
ViBid-S (w/o conv) & $O(N)$ & None & 79.1 \\
ViBid-S (w/o conv) & $O(N)$ & BiNorm & \textbf{80.3} \\
\hline
\end{tabular}
\label{tb:ablation_softmax_free}
\end{table}


\begin{table}[t]
\centering
\caption{\textbf{Comparison of the linear attention algorithms at ImageNet1k top-1 accuracy.} To re-implement the LinFormer and Efficient Attention algorithms to compare ImageNet1k top-1 accuracy, we adopted ViT-S design for them. ViBid-S which LPI is absent acheives higher performance than any other algorithms.}
\begin{tabular}{c|c|c}
\hline
\rowcolor[HTML]{EFEFEF}
& & Top-1 \\
\rowcolor[HTML]{EFEFEF}
\multirow{-2}{*}{Model} & \multirow{-2}{*}{GPU Thr.} & Acc. \\
\hline
ViBid-S (w/o conv) & \textbf{547.3} & \textbf{80.3} \\
LinFormer & 403.9 & 75.7 \\
Efficient Attention & 416.1 & 76.3 \\
\hline
\end{tabular}
\label{tb:comparison_with_lsa_acc}
\end{table}

\paragraph{GPU throughput.} In Figure~\ref{fig:resolution_throughput_comparison}, the GPU throughput of transformer-based models is reported at various resolutions. As shown, our model is faster than other models showing similar performance. In addition, the GPU throughput of our model decreases slowly compared to other models as input resolution increases. It is because our proposed algorithm does not require additional kernel optimizations of the frameworks as well as it has $O(N)$ complexity.

\paragraph{Comparison with the existing linear attention algorithms.} To compare to various the linear self-attention algorithms, which is introduced in Section ~\hyperref[related_works]{2}, we implemented blending of ViT design and the linear self-attention algorithms~\citep{wang2020linformer, qin2022cosformer, shen2021efficient}. Since all the other algorithms do not use the additional layers, we removed the convolutional layers from our models to perform experiment on the equal conditions. Our proposed method achieves the best GPU throughput and memory consumption for every input resolution, and outperforms the other algorithms for ImageNet1k classification task. We added experimental results in Table \ref{tb:comparison_with_lsa_acc}-\ref{tb:comparison_with_lsa_compute} for details.


\begin{table}[t]
\centering
\caption{\textbf{Computational efficiency of the linear attention algorithms.} Note that all networks are implemented on the same architecture design. ViBid-M shows the highest GPU Throughputs, and consumes the lowest GPU memory. All measurement is performed on 1 NVIDIA V100 GPU.}
\begin{subtable}[h]{\linewidth}
\begin{tabular}{c|c|c|c|c}
\hline
\rowcolor[HTML]{EFEFEF}
& ViBid-M & Cos- & Lin- & Efficient \\
\rowcolor[HTML]{EFEFEF}
\multirow{-2}{*}{Res.} & (w/o conv) & Former & Former & Attention \\
\hline
224 & \textbf{547.3} & 384.9 & 403.9 & 416.1 \\
384 & \textbf{209.9} & 147.8 & 151.7 & 186.3 \\
512 & \textbf{119.3} & 84.7 & 85.9 & 114.5 \\
1024 & \textbf{24.2} & 18.4 & 20.1 & 23.9 \\
\hline
\end{tabular}
\centering
\caption{GPU Throughput (img/s, batch size $b=32$)}
\end{subtable}
\begin{subtable}[h]{\linewidth}
\begin{tabular}{c|c|c|c|c}
\hline
\rowcolor[HTML]{EFEFEF}
& ViBid-M & Cos- & Lin- & Efficient \\
\rowcolor[HTML]{EFEFEF}
\multirow{-2}{*}{Res.} & (w/o conv) & Former & Former & Attention \\
\hline
224 & \textbf{0.38} & 0.41 & 0.42 & 0.38 \\
384 & \textbf{0.65} & 0.69 & 0.70 & 0.65 \\
512 & \textbf{1.05} & 1.08 & 1.11 & 1.05 \\
1024 & \textbf{3.77} & 3.81 & 3.91 & 3.77 \\
\hline
\end{tabular}
\centering
\caption{GPU Memory Allocation (GB)}
\end{subtable}
\label{tb:comparison_with_lsa_compute}
\end{table}


\section{Discussion and conclusion}
\label{conclusion}
We propose the simplest transformer architecture to improve the time and memory complexity of any transformer from $O(N^2)$ to $O(N)$. The complexity of a general SA transformer algorithm is $O(N^2)$, and it scales linearly with the input resolution (token size). However, by multiplying BiNorm by $K^TV$ and then $Q$, the proposed algorithm is designed to be $O(N)$, allowing the model parameters to be considerably lowered, and a large resolution can be achieved. Furthermore, previous
transformer techniques had complex architectures to compensate for the reduced performance with window attention, kernel-based attention, and pattern-based attention; however, ViBid employing BiNorm does not have one. It performs similarly to the previous algorithm. We expect our proposed algorithm to be universally applied to all transformer algorithms with $QKV$ because it can be used with very small code modifications.

\paragraph{Limitation.} Our proposed algorithm can be used for all general vision tasks, such as image classification, object detection, and segmentation. However, its structures need to be more optimized for each task, so we did not introduce the performance of our models for the object detection or the segmentation tasks. In the future, we intend to perform experiments using generally used architectural optimizations, like multi-scale structures, or compound scaling, to show the SoTA-level performance at those tasks.

% \newpage

% \section{Supplementary Material}

%%%%%%%%% REFERENCES
\bibliography{song_11}

\end{document}
