Abstract: Linear attention has been proposed as an efficient alternative to softmax attention, particularly for language modeling. Motivated by its lower computational complexity, many works have applied linear attention to vision tasks as well. In this paper, we focus on simple isotropic architectures and investigate two key architectural design aspects of extending linear attention to visual data: scanning methods and hybrid architectures combining linear and softmax attention. We study a series of scanning methods, and empirical results suggest that the scanning strategy itself provides limited benefits. In contrast, hybrid models yield promising results compared to pure linear attention models. Focusing on the hybrid design, we further investigate several types of softmax attention suitable for integration and find that the tiled version of high-order sliding window attention (HSWA) is efficient in both theory and practice. We name the resulting simple architecture, which combines linear attention with HSWA, VBased, and conduct additional experiments to evaluate its effectiveness. With performance comparable to Transformers and equal efficiency (on moderate sequences) or superior efficiency (on long sequences), VBased offers a promising path for the adoption of linear attention in vision and can serve as a simple baseline for future architectural research.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Meisam_Razaviyayn1
Submission Number: 6131
Loading