Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration

Published: 01 Jan 2025, Last Modified: 27 Mar 2025CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large vision-language models (LVLMs) excel at visual understanding and reasoning, but face efficiency challenges due to quadratic complexity in processing long multimodal contexts. While token compression techniques can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of recent high-resolution LVLMs with dynamic tiling. While existing methods treat all tokens uniformly, our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic tiling strategy comprehensively, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose "Global Compression Commander" (i.e., GlobalCom$^2$), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the "commander" to guide the compression process of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom$^2$ maintains over 90\% performance while compressing 90\% visual tokens, reducing FLOPs and peak memory to 9.1\% and 60\% respectively across multiple benchmarks. Our code is available at https://github.com/xuyang-liu16/GlobalCom2.
Loading