Abstract: Transformers have astounding representational power but typically consume considerable computation and memory. The current popular Swin transformer reduces computational and memory costs via a local window strategy. However, this inevitably causes two drawbacks: i) the local window-based self-attention mitigates global dependency modeling capability; ii) recent studies point out that the local windows impair robustness. This paper proposes a novel defactorization self-attention mechanism (DeSA) that enjoys both the advantages of local window cost and long-range dependency modeling. Specifically, we defactorize a large area of feature tokens into non-overlapping subsets and obtain a strictly limited number of key tokens enriched of long-range information through cross-set interaction. Equipped with a new mixed-grained multi-head attention that adjusts the granularity of the key features in different heads, DeSA is capable of modeling long-range dependency while aggregating multi-grained information at a computational and memory cost equivalent to the local window-based self-attention. With DeSA, we present a family of models named defactorization vision transformer (DeViT). Extensive experiments show that our DeViT achieves state-of-the-art performance on both classification and downstream tasks, while demonstrating strong robustness to corrupted and biased data. Compared with Swin-T, our DeViT-B2 significantly improves classification accuracy by $1\%$ and robustness by $6\%$, and reduces model parameters by $14\%$. Our code will soon be publically available at https://github.com/anonymous0519/DeViT.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
20 Replies
Loading