CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: token merging, token reduction, vision transformer, off-the-shelf
TL;DR: A token merging method that preserves 2D spatial structure, making it compatible to ViT with spatial architectures such as SAM and DINOv3
Abstract: Many modern ViT backbones have adopted spatial architectural designs, such as window attention in Swin, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. While token reduction has been a successful research direction for reducing computational costs of ViT, the vast majority of existing methods fail to preserve the structured spatial layouts these architectures fundamentally depend on, rendering them incompatible with such spatial architectures. In this paper, we introduce a simple yet effective token merging method that maintains spatial layouts, enabling seamless compatibility with spatial architectures. We show how to reconcile two seemingly conflicting requirements: exploiting the uneven information distribution across the spatial layout while preserving spatial structure of merged tokens. Our approach employs (1) a 2D token reduction strategy that ensures structured 2D layouts in the resulting tokens, (2) a spatial-aware merging algorithm to selectively merge redundant tokens while preserving relative spatial relationships of the tokens, and (3) a novel max-magnitude-element token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25× speedup on SAM-H with only 0.7 mIOU drop evaluated on COCO off-the-shelf, and 1.15× speedup on DeiT-B without accuracy drop evaluated on ImageNet within just one epoch of fine-tuning.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23621
Loading