Keywords: Vision Transformers (ViT), dynamic token pruning, parameter-free saliency, sparse token selection, efficient attention, analytic FLOPs, PASCAL VOC, CIFAR-100, Tiny-ImageNet, LIME explainability, DynamicViT, ToMe
TL;DR: Dyna-ViT prunes tokens before the encoder using a parameter-free saliency score (top-K patches), keeping a standard ViT backbone while delivering ~20–28% faster training with matched or better accuracy on VOC, CIFAR-100, and Tiny-ImageNet.
Abstract: Vision Transformers (ViTs) achieve state-of-the-art results, yet their quadratic self-attention is inefficient, largely due to redundant processing of low-information background patches. We introduce Dyna-ViT, a simple, parameter-free framework for dynamic token pruning that ranks patches with an unsupervised saliency proxy and retains only the top-K before the encoder. The backbone remains an unmodified ViT; no extra modules or learnable parameters are added. Across three benchmarks, Dyna-ViT preserves accuracy while reducing compute. On PASCAL VOC, keeping 70% of patches is 25% faster per epoch and improves validation accuracy (97.1%) over the full-token baseline (96.8%). On CIFAR-100, Dyna-ViT attains 91.3% test accuracy versus 92.0% for the baseline with a 28% speed-up. On Tiny-ImageNet, it reaches 81.4% validation accuracy with 20–25% faster training. A simple analytic FLOPs model that scales with sequence length closely matches external estimates (e.g., K=60%, S=119: 10.48 vs. 10.23 GFLOPs), aligning with measured throughput gains. Ablations over K and alternative scoring functions (Sobel, Entropy) confirm robustness, and LIME visualizations show retained tokens align with semantically relevant regions. Under matched token budgets and backbones, Dyna-ViT is competitive with, and sometimes exceeds, learned sparsification (DynamicViT) and in-encoder token merging (ToMe), while introducing no additional parameters. These results indicate that parameter-free patch selection can substantially improve ViT efficiency, often acting as a beneficial regularizer with minimal or positive impact on accuracy.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24975
Loading