Keywords: Vision Transformer, Linear Transformer, Bidirectional Normalization, Image Classification
TL;DR: We empirically demonstrated the shortcomings of softmax-free and the significance of softmax in attention through BiNorm experiments. Binorm is the simplest adaptation of the current matrix multiplication order-changing algorithms.
Abstract: The vision transformer has achieved state-of-the-art performance in various vision tasks; however, the memory consumption is larger than those of previous convolutional neural network based models because of O(N^2) time and memory complexity of the general self-attention models. Many approaches aim to change the complexity to O(N) to solve this problem; however, they stack deep convolutional layers to retain locality or complicate the architecture as seen in window attention, to compensate for the performance degradation. To solve these problems, we propose ViBid algorithm, which resolves the complexity problem of O(N^2) by replacing Softmax with bidirectional normalization (BiNorm). In addition, it has a much simpler architecture than the existing transformer model with O(N) complexity. Owing to our simple architecture, we were able to use larger resolutions for training, and we obtained a lighter and superior GPU throughput model with competitive performance. ViBid can be used with any transformer method that uses queries, keys, and values (QKV) because of BiNorm, and it is quite universal due to its simple architectural structure.
Supplementary Material: pdf
Other Supplementary Material: zip
0 Replies
Loading