everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
The high computational cost of vision transformers blocks their implementation on resource-limited devices such as mobile phones. Reducing the number of tokens can significantly accelerate the inference process and save computational resources. Most of the existing token pruning methods focus on evaluating token's importance and discard the unimportant tokens directly, which incur significant information loss. A few methods suggest ways focusing on merging while directly partition tokens into two parts by random or odd/even partition, which do not consider carefully how to select tokens. In this paper, we propose a new token condensation method based on the connectivity between tokens. Different from the previous methods, we gradually condense the large number of tokens by selection and fusion. The most representative tokens are selected and the others are separately fused into them. Extensive experiments are conducted on benchmark datasets. Compared with the existing methods, our method achieves higher accuracy with lower computational cost. For example, our method can reduce 50% FLOPs of DeiT-S without accuracy degradation on ImageNet dataset.