LinCNNFormer: Hybrid Linear Vision Transformer Based on Convolution Neural Network and Linear Attention

Published: 2024, Last Modified: 06 Nov 2025DICTA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the dynamic field of computer vision, vision transformers have emerged as a powerful tool, yet they are confronted by significant computational complexity, primarily due to the quadratic complexity inherent in their self-attention mechanism. To address this problem, recent innovations have explored the use of linear attention mechanisms, replacing the traditional softmax operation with activation or mapping functions to compute query-key similarities. However, these approaches often fall short in performance when compared to their softmax-based counterparts. Addressing this gap, we introduce a novel hybrid linear vision transformer, LinCNNFormer, that merges the inductive bias capabilities of Convolutional Neural Networks (CNN) with the expansive reach of linear attention, specifically employing ReLU attention. This strategic amalgamation harnesses the strengths of both systems, significantly boosting performance metrics. A further innovation of our design is to divide the feature maps into two groups along the channel dimensions, the two branches dedicated to convolution and linear attention, respectively, prior to a reunifying concatenation along the channel dimensions. This approach effectively curbs the potential increase in computational overhead. We conduct experiments on three benchmark datasets, CIFAR-10, CIFAR-100 and Tiny-ImageNet. Impressive results demonstrate that LinCNNFormer sets a new benchmark in terms of classification accuracy with less number of parameters surpassing other existing methodologies.
Loading