Keywords: vision transformer, long sequence modeling
Abstract: Vision transformers, which propose to tokenize an image and introduce attention mechanism to learn cross-token relationship, have advanced many computer vision tasks.However, the attention module owns a quadratic computational complexity and hence suffers from slow computing speed and high memory cost, hindering it from handling long sequences of tokens.Some attempts optimize the quadratic attention with linear approximation yet observe undesired performance drop.This work balances the trade-off between modeling efficiency and capacity of vision attention.We notice that, by treating queries and keys as nodes in a graph, existing algorithms are akin to modeling one-step interaction between nodes.To strengthen the cross-node connection for a more representative attention, we introduce multi-step interaction, which is equivalent to solving an inverse matrix as in random walk graph kernel.We then come up with a new strategy to construct queries and keys, with the help of bipartite graph, to ease the calculation of matrix inversion.The effectiveness of our approach is verified on various visual tasks. We also make it possible to learn a vision transformer with extremely long sequences of tokens.We achieved the competitive results on the semantic segmentation task with 15% fewer parameters and 10-25% less computation. In addition, the vision transformer based quantization method can be applied to 512x512 or even 1024x1024 resolution images. Code will be made publicly available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: We approach a novel linear attention mechanism based on random walk graph kernel, can be widely used in vision transformer with long sequence inputs
6 Replies
Loading