Keywords: Multimodal Large Language Models, Vision Encoder, Vision Token Compression
Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting efficiency.
% limiting their potential for efficiency gains.
To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework for effective token compression within the vision encoder's intermediate layers. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that LaCo outperforms all existing methods when compressing tokens in the vision encoder's intermediate layers, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20\% and inference throughput over 15\% while maintaining strong performance.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19989
Loading