Vote & Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Published: 2025, Last Modified: 12 Nov 2025ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote&Mix (VoMix), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models without any training. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2× increase in throughput of existing ViT-H on ImageNet-1K and a 2.4× increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3% drop in top-1 accuracy.
Loading