PREP: Pre-inference Guided Token Pruning for Efficient Vision-Language Models

PREP: Pre-inference Guided Token Pruning for Efficient Vision-Language Models

ICLR 2026 Conference Submission16446 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual-Language Models (VLMs), Training-Free Pruning, Information Bottleneck, Inference Acceleration

Abstract: Recent Visual-Language Models (VLMs) have demonstrated strong fine-grained perception capabilities across a wide range of Visual Question Answering (VQA) tasks. However, this advantage comes at the cost of a rapidly increasing number of visual tokens, leading to substantial computational and memory overhead. Existing training-free methods adopt fixed-layer or layer-by-layer pruning, which disrupts modality fusion before alignment and leads to significant performance degradation under high pruning ratios. In this study, we observe that after the early stage of modal fusion, cross-modal attention not only accurately identifies regions of interest but also demonstrates less sensitive to pruning. Building on this, we propose \textbf{PREP}, a training-free method that identifies optimal pruning layer via patch-level pre-inference, thereby avoiding the loss of fine-grained details under stepwise pruning. Specifically, PREP identifies the the layer with accurate cross-modal alignment using an \textbf{E}ntropy--\textbf{KL} divergence (EKL) score derived from the Information Bottleneck principle, and then retains tokens at this layer that are critical for visual integrity and semantic alignment during full inference. Experiments on LLaVA-1.5-7B show that with only \textbf{9} visual tokens and half of the layers used in pre-inference, PREP preserves \textbf{96.2\%} of the original performance while retaining just \textbf{16} visual tokens (\textbf{3\%}), leading to a \textbf{67\%} reduction in KV-cache usage and a \textbf{1.66$\times$} acceleration in inference speed. We have presented our code in the supplementary materials.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 16446

Loading