Keywords: efficient transformer; token merging; matrix operation; inference speed; training speed; off-the-shelf
TL;DR: We propose a training-free, plug-and-play, fully matrix-based, and end-to-end differentiable token merging method to accelerate the inference and training of Vision Transformers.
Abstract: We introduce MaMe, a training-free, differentiable token merging method that relies entirely on matrix operations to accelerate vision transformers. When applied to pre-trained models, MaMe doubles ViT-B@224 throughput with a mere 2\% drop in accuracy. For training from scratch, a ViT-T model with MaMe achieves 1.94x throughput with a 1.3\% accuracy drop. As a downsampling layer in Swin architectures, MaMe reduces FLOPs by 2.4x for Swin-S backbones, achieving 47.0\% mIoU on ADE20K semantic segmentation. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3× acceleration with negligible performance degradation (78.02 vs. 78.37). For multimodal reasoning, MaMe accelerates LLaVA-v1.5-7B inference by 36\% on MME with minimal degradation (31.40 vs. 32.76). In video tasks, MaMe accelerates VideoMAE-L by 48.5\% on Kinetics-400 with a 0.84\% accuracy loss. Collectively, these results demonstrate MaMe's effectiveness in accelerating transformer-based vision and multimodal models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19150
Loading