MaMe: Matrix-Based Token Merging

MaMe: Matrix-Based Token Merging

ICLR 2026 Conference Submission19150 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: efficient transformer; token merging; matrix operation; inference speed; training speed; off-the-shelf

TL;DR: We propose a training-free, plug-and-play, fully matrix-based, and end-to-end differentiable token merging method to accelerate the inference and training of Vision Transformers.

Abstract: We introduce MaMe, a training-free, differentiable token merging method that relies entirely on matrix operations to accelerate vision transformers. When applied to pre-trained models, MaMe doubles ViT-B@224 throughput with a mere 2\% drop in accuracy. For training from scratch, a ViT-T model with MaMe achieves 1.94x throughput with a 1.3\% accuracy drop. As a downsampling layer in Swin architectures, MaMe reduces FLOPs by 2.4x for Swin-S backbones, achieving 47.0\% mIoU on ADE20K semantic segmentation. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3× acceleration with negligible performance degradation (78.02 vs. 78.37). For multimodal reasoning, MaMe accelerates LLaVA-v1.5-7B inference by 36\% on MME with minimal degradation (31.40 vs. 32.76). In video tasks, MaMe accelerates VideoMAE-L by 48.5\% on Kinetics-400 with a 0.84\% accuracy loss. Collectively, these results demonstrate MaMe's effectiveness in accelerating transformer-based vision and multimodal models.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19150

Loading