Compress & Cache: Vision token compression for efficient generation and retrieval

Adrian Bulat; Yassine Ouali; Georgios Tzimiropoulos

Compress & Cache: Vision token compression for efficient generation and retrieval

Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: token compression, llava

Abstract: This work aims to compress the vision tokens of an LVLM into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) storage-efficient. To this end, we propose C&C, a novel compression method that leverages the LVLM itself for task-agnostic visual token compression. Unlike prior methods that perform token reduction on-the-fly, our approach offloads computation to a dedicated, upfront indexing stage, effectively decoupling compression from generation. This enables learning more powerful representations for generation during inference. At the core of C&C is a ``double-forward pass'' training strategy. During the first forward pass, the LLM (of the LVLM) creates a bottleneck by compressing the dense visual tokens into a few summary tokens. Subsequently, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training of C&C is guided by two key losses: an autoregressive loss applied after the second pass that provides a direct optimization objective for reconstructing the original information flow, and a contrastive loss applied after the first pass to bolster the representational strength of the summary tokens, particularly for discriminative tasks. Moreover, we propose stage-specific adapters for further enhancing performance. C&C produces highly informative compressed representations. An in-depth ablation study confirms the efficacy of our approach. For generative tasks, we achieve a 2x higher compression rate without compromising capabilities, setting a new state-of-the-art. For discriminative tasks, we establish new state-of-the-art results on image retrieval and compositionality benchmarks.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 13614

Loading