Modality-Aware Quantization: Balancing Visual and Textual Fidelity in Multimodal Compression

Yue Zhang; Yuexiao Ma; Guilin Li; Yuqi Liu; Jiaqi Zhou; Qingheng Zhang; Yan Zhang; Fei Chao; Rongrong Ji; Xiawu Zheng

Modality-Aware Quantization: Balancing Visual and Textual Fidelity in Multimodal Compression

Yue Zhang, Yuexiao Ma, Guilin Li, Yuqi Liu, Jiaqi Zhou, Qingheng Zhang, Yan Zhang, Fei Chao, Rongrong Ji, Xiawu Zheng

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Model Compression

Abstract: Vision-language models (VLMs) have achieved remarkable capabilities across multimodal tasks, yet their deployment remains constrained by substantial computational requirements. While post-training quantization (PTQ) offers a practical solution for model compression, existing methods fail to address a fundamental challenge in VLM quantization: the inherent heterogeneity between visual and textual representations. In this work, we identify and formalize a critical failure mode where visual tokens, despite their lower semantic density, dominate the quantization optimization process due to their extreme value distributions and numerical prevalence. This dominance systematically degrades the preservation of semantically-critical language tokens, severely impacting model performance. We present a theoretically grounded framework that proves this trade-off through formal analysis and introduce an adaptive optimization pipeline that dynamically balances cross-modal heterogeneity. Our method leverages activation-scale statistics and gradient-sensitivity priors to construct layer-wise modality weights that counteract visual dominance while preserving linguistic fidelity—all without altering the inference computation graph. Extensive experiments demonstrate state-of-the-art performance across diverse quantization regimes: on Qwen-VL-Chat with W4A8 quantization, we achieve 59.27% on TextVQA, surpassing the previous best method MQuant by +2.70%. Most notably, under extreme W4A4 quantization where existing approaches fail catastrophically, our method maintains robust performance (55.72% on TextVQA), proving that aggressive multimodal compression is both achievable and practical for real-world deployment. The code is available at this anonymous link: https://anonymous.4open.science/status/MAQ-23971iclrAnonymous

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23971

Loading