Vision-centric Token Compression in Large Language Model

Published: 18 Sept 2025, Last Modified: 11 Dec 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Token Compression, Long Context LLMs, Large Language Model, Visual-Text, Vision-centric
TL;DR: We present a vision-centric token compression in LLM, inspired by human selective reading strategy.
Abstract: Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making $\textit{token compression}$ indispensable. We introduce Vision Centric Token Compression ($\textbf{Vist}$), a $\textit{slow–fast}$ compression framework that mirrors human reading: the $\textit{fast}$ path renders distant tokens into images, letting a $\textbf{frozen, lightweight vision encoder}$ skim the low-salience context; the $\textit{slow}$ path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions—just as skilled reader gloss over function words. On eleven in-context learning benchmarks, $\textbf{Vist}$ achieves the same accuracy with 2.3$\times$ fewer tokens, cutting FLOPs by 16\% and memory by 50\%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by $\textbf{7.6}$\% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 5467
Loading