Similarity-Aware Token Pruning: Your VLM but Faster

Published: 15 Mar 2025, Last Modified: 07 Oct 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Vision Transformers (ViTs) and Vision–Language Models (VLMs) scale poorly with input length: self-attention is quadratic in token count and feed-forward memory is linear. Token pruning can curb these costs; yet most approaches either require extra training or rely on rigid, sample-agnostic schedules, and none transfers cleanly between ViTs and VLMs. We begin by examining two common pruning signals, attention weights and token similarity, across multiple architectures and observe that token similarity consistently outperforms attention weights as a pruning criterion. We then show that dropping similar tokens outperforms the prevailing strategy of merging them when accuracy–latency trade-offs are measured end-to-end. Building on this insight, we introduce SAINT, a training-free, graph-based framework that ranks tokens by pairwise similarity and adjusts both pruning depth and rate on-the-fly. SAINT aggressively removes early-layer tokens, where latency gains are largest, while preserving task-critical information. Experiments show that SAINT doubles the throughput of ViT-H/14 @224 px with only a 0.6% top-1 drop on ImageNet-1K, beating the strongest baseline by 0.8%. Applied to LLaVA-13B, SAINT prunes 75% of tokens and delivers the latency of LLaVA-7B with <1% performance loss across standard VLM benchmarks. SAINT, therefore, offers a unified plug-and-play path to efficient inference for both visual and multimodal transformers.
Loading