Towards Efficient Post-Training Quantization For Large Vision-Language Models Via Token-Wise Redundancy Elimination

Yufei Xue; Yushi Huang; Jiawei Shao; Lunjie Zhu; Chi Zhang; Xuelong Li; Jun Zhang

Towards Efficient Post-Training Quantization For Large Vision-Language Models Via Token-Wise Redundancy Elimination

Yufei Xue, Yushi Huang, Jiawei Shao, Lunjie Zhu, Chi Zhang, Xuelong Li, Jun Zhang

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models (VLMs), Post-Training Quantization (PTQ), Model Compression

TL;DR: An efficient and accurate post-training quantization (PTQ) framework for large vision-language models (VLMs). The key innovation is the use of gradient-driven importance factors to eliminate token-wise redundancy, significantly improving performance.

Abstract: Post-training quantization (PTQ) has emerged as an effective technique for compressing large models and accelerating inference without retraining. While PTQ has been extensively studied in large language models (LLMs), its application to vision-language models (VLMs) remains underexplored. In this work, we identify two intrinsic characteristics of VLM activations: 1) visual over-representation, where vision tokens are excessive and often redundant, and 2) modality gap, which refers to the clear separation between text and vision tokens in the latent feature space. Together, these two factors significantly deteriorate quantization performance but have been overlooked by existing PTQ methods. To address these challenges, we propose VLMQ, A VLM-tailored PTQ framework that selectively prioritizes salient tokens while suppressing redundant ones during quantization. In particular, we introduce a gradient-driven importance factor to capture the token-wise importance variance, the effectiveness of which is substantiated through both empirical and theoretical analysis. To ensure efficiency, we propose to use lightweight block-wise backpropagation for factor acquisition. Finally, we reformulate the optimization objective into an importance-aware form to preserve importance activation information. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial 16.45% improvement on MME-RealWorld under 2-bit quantization. Code is provided in the supplementary material.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7401

Loading