A 0-Shot Self-Attention Mechanism for Accelerated Diagonal Attention

Published: 2025, Last Modified: 09 Nov 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The ability of Transformers to process longer sequences has led to unprecedented levels of generalization in visual tasks. However, the complexity of Transformers is dominated by the quadratic cost associated with the computation of the attention blocks, posing a bottleneck that impedes the scaling of sequence length and the realization of more advanced AI solutions. We propose and explore the hypothesis that the self-attention mechanism exhibits regularities that can be exploited to enhance performance and achieve linear-cost attention without significant loss of effectiveness. Specifically, we investigate the attention matrix of Visual Transformers to identify and leverage these regularities in order to simplify the computation process. The resulting procedure significantly reduces the computational cost of Transformers by directly reducing attention block complexity. Moreover, the designed procedure is 0-shot self-supervised, thus it requires no retraining, additional data or parameters, as all Transformer parameters remain unchanged. Consequently, the proposed method can be seam-lessly applied to pre-trained Visual Transformers without the need for retraining. Experiments conducted on a series of Vision Transformers pre-trainedon ImageNet-1K dataset demonstrate the effectiveness of our proposed approach.
Loading