LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

ICLR 2026 Conference Submission19989 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Vision Encoder, Vision Token Compression

Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting efficiency. % limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework for effective token compression within the vision encoder's intermediate layers. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that LaCo outperforms all existing methods when compressing tokens in the vision encoder's intermediate layers, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20\% and inference throughput over 15\% while maintaining strong performance.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19989

Loading