Abstract: Following the great success in natural language processing, transformer-based models have emerged as the competitive model against the convolutional neural networks in computer vision. Vision transformer (ViT) and its subsequent variants have exhibited promising performance in tasks such as image classification, object detection and semantic segmentation. The core of vision transformers is the self-attention mechanism, which models the long-range dependency of different tokens. Conventionally, the attention matrix in self-attention is calculated by the scaled dot-product of \textit{query} (Q) and \textit{key} (K). In this case, the attention weight would depend on norm of Q and K as well as the angle between them. In this paper, we propose a new attention mechanism named angular self-attention, which replaces the scaled dot-product operation with the angular function in order to effectively model the relationship between tokens. In particular, we propose two forms of functions: quadratic and cosine functions, for our angular self-attention. Based on angular self-attention, we design a new vision transformer architecture called dual-windowed angular vision transformer (\textbf{DWAViT}). DWAViT is a hierarchical-structured model characterized by the angular self-attention and a new local window mechanism. We evaluate DWAViT on multiple computer vision benchmarks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K. Our experimental results also suggest that our model can achieve promising performance on the tasks while maintaining comparable computational cost with that of the baseline models (e.g., Swin Transformer).
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=pK6FkQv1Hq&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: 1. In our revised paper, we propose two methods to sovle the excessive computational cost of our model. First, in our quadratic self-attention, a linear function is proposed to approximate the $\arccos$ function to obtain the angles between the queries and keys, which can facilitate the running time in both training and inference stages. Our experimental results show that the linear function is sufficient to model the angles between the tokens without any loss of the performance. Second, since the size of the local window in our model is adjustable, the running time can be reduced considerably by enlarging the number of local windows. For instance, in the first stage of our model the local window (even) is increased from 64 to 100. In the second stage of our model the number of the local window (odd) is increased from 9 to 49. With the two methods our updated model can achieve comparable computational cost in both training and inference stages with the baseline models (e.g., Swin Transformer) and the evaluation can be found in Table 7.
2. We perform the ablation study on our proposed dual local window. Specifically, we choose DWAViT-T as the target model and replace the our dual local window with the shifted local window proposed in Swin Transformer. The results on different tasks can be found in Table 6, 13 and 14. Our experimental results suggest that our dual local window can achieve better performance than that of shifted local window on most tasks.
3. We also cite the latest papers associated with vision transformers published on top-tier conferences in 2023 and adopt some of them as our new baselines. The results suggest our model can achieve comparable or better performance than that of the latest models on most tasks.
4. We add the comments from our previous rebuttal in our revised paper to address the concern from the previous reviewers.
Code: https://github.com/DamoSWL/DWAViT
Assigned Action Editor: ~Hongsheng_Li3
Submission Number: 1982
Loading