ViBid: Linear Vision Transformer with Bidirectional Normalization

Jeonggeun Song; Heung-Chang Lee

ViBid: Linear Vision Transformer with Bidirectional Normalization

Jeonggeun Song, Heung-Chang Lee

Published: 08 May 2023, Last Modified: 26 Jun 2023UAI 2023Readers: Everyone

Keywords: Vision Transformer, Linear Transformer, Bidirectional Normalization, Image Classification

TL;DR: We empirically demonstrated the shortcomings of softmax-free and the significance of softmax in attention through BiNorm experiments. Binorm is the simplest adaptation of the current matrix multiplication order-changing algorithms.

Abstract: The vision transformer has achieved state-of-the-art performance in various vision tasks; however, the memory consumption is larger than those of previous convolutional neural network based models because of O(N^2) time and memory complexity of the general self-attention models. Many approaches aim to change the complexity to O(N) to solve this problem; however, they stack deep convolutional layers to retain locality or complicate the architecture as seen in window attention, to compensate for the performance degradation. To solve these problems, we propose ViBid algorithm, which resolves the complexity problem of O(N^2) by replacing Softmax with bidirectional normalization (BiNorm). In addition, it has a much simpler architecture than the existing transformer model with O(N) complexity. Owing to our simple architecture, we were able to use larger resolutions for training, and we obtained a lighter and superior GPU throughput model with competitive performance. ViBid can be used with any transformer method that uses queries, keys, and values (QKV) because of BiNorm, and it is quite universal due to its simple architectural structure.

Supplementary Material: pdf

Other Supplementary Material: zip

0 Replies

Loading