Abstract: Recently, transformer purely based on attention mechanism has been applied to a wide range of tasks and achieved impressive performance. Though extensive efforts have been made, there are still drawbacks to the transformer architecture which hinder its further applications: (i) the quadratic complexity brought by attention mechanism; (ii) barely incorporated inductive bias. In this paper, we present a new hierarchical walking attention, which provides a scalable, flexible, and interpretable sparsification strategy to reduce the complexity from quadratic to linear, and meanwhile evidently boost the performance. Specifically, we learn a hierarchical structure by splitting an image with different receptive fields. We associate each high-level region with a supernode, and inject supervision with prior knowledge in this node. Supernode then acts as an indicator to decide whether this area should be skipped and thereby massive unnecessary dot-product terms in attention can be avoided. Two sparsification phases are finally introduced, allowing the transformer to achieve strictly linear complexity. Extensive experiments are conducted to demonstrate the superior performance and efficiency against state-of-the-art methods. Significantly, our method sharply reduces the inference time and the total of tokens by 28% and $94%$ respectively, and brings 2.6%@Rank-1 promotion on MSMT17.
0 Replies
Loading