Abstract: The Transformer backbone network with the self-attention mechanism as the core has achieved great success in the field of natural language processing and computer vision. However, through the self-attention mechanism brings high performance, it also brings higher computational complexity compared to the classic visual feature extraction methods. To further reduce the complexity of self-attention mechanism and explore its lighter version in computer vision, in this paper, we design a novel lightweighted self-attention mechanism: Low-relation Mutil-head Self-Attention (LMSA), which is superior than the recent self-attention. Specifically, the proposed self-attention mechanism breaks the barrier of the dimensional consistency of the traditional self-attention mechanism, resulting in lower computational complexity and occupies less storage space. In addition, employing the new mechanism can release part of the computing consumption of the Transformer network and make the best use of it. Experimental results show that the dimensional consistency inside the traditional self-attention mechanism is unnecessary. In particular, using Swin as the backbone model for training, the accuracy in CIFAR-10 image classification task is improved by 0.43$\%$, in the meanwhile, the consumption of a single self-attention resource is reduced by 64.58$\%$, and the number of model parameters and model size are reduced by more than 15$\%$. By appropriately compressing the dimensions of the self-attention relationship variables, the Transformer network can be more efficient and even perform better. The results prompt us to rethink the reason why the self-attention mechanism works.
5 Replies
Loading