Improving Vision Attention with Random Walk Graph Kernel

Yifei Zhang; Kecheng Zheng; Yujun Shen; Yu Liu; Lianghua Huang; Zhantao Yang; Han Zhang; Deli Zhao; Fan Cheng

Improving Vision Attention with Random Walk Graph Kernel

Yifei Zhang, Kecheng Zheng, Yujun Shen, Yu Liu, Lianghua Huang, Zhantao Yang, Han Zhang, Deli Zhao, Fan Cheng

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: vision transformer, long sequence modeling

Abstract: Vision transformers, which propose to tokenize an image and introduce attention mechanism to learn cross-token relationship, have advanced many computer vision tasks.However, the attention module owns a quadratic computational complexity and hence suffers from slow computing speed and high memory cost, hindering it from handling long sequences of tokens.Some attempts optimize the quadratic attention with linear approximation yet observe undesired performance drop.This work balances the trade-off between modeling efficiency and capacity of vision attention.We notice that, by treating queries and keys as nodes in a graph, existing algorithms are akin to modeling one-step interaction between nodes.To strengthen the cross-node connection for a more representative attention, we introduce multi-step interaction, which is equivalent to solving an inverse matrix as in random walk graph kernel.We then come up with a new strategy to construct queries and keys, with the help of bipartite graph, to ease the calculation of matrix inversion.The effectiveness of our approach is verified on various visual tasks. We also make it possible to learn a vision transformer with extremely long sequences of tokens.We achieved the competitive results on the semantic segmentation task with 15% fewer parameters and 10-25% less computation. In addition, the vision transformer based quantization method can be applied to 512x512 or even 1024x1024 resolution images. Code will be made publicly available.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

TL;DR: We approach a novel linear attention mechanism based on random walk graph kernel, can be widely used in vision transformer with long sequence inputs

6 Replies

Loading