Modality-Aware Quantization: Balancing Visual and Textual Fidelity in Multimodal Compression

ICLR 2026 Conference Submission23971 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Model Compression
Abstract: Vision-language models (VLMs) have achieved remarkable capabilities across multimodal tasks, yet their deployment remains constrained by substantial computational requirements. While post-training quantization (PTQ) offers a practical solution for model compression, existing methods fail to address a fundamental challenge in VLM quantization: the inherent heterogeneity between visual and textual representations. In this work, we identify and formalize a critical failure mode where visual tokens, despite their lower semantic density, dominate the quantization optimization process due to their extreme value distributions and numerical prevalence. This dominance systematically degrades the preservation of semantically-critical language tokens, severely impacting model performance. We present a theoretically grounded framework that proves this trade-off through formal analysis and introduce an adaptive optimization pipeline that dynamically balances cross-modal heterogeneity. Our method leverages activation-scale statistics and gradient-sensitivity priors to construct layer-wise modality weights that counteract visual dominance while preserving linguistic fidelity—all without altering the inference computation graph. Extensive experiments demonstrate state-of-the-art performance across diverse quantization regimes: on Qwen-VL-Chat with W4A8 quantization, we achieve 59.27% on TextVQA, surpassing the previous best method MQuant by +2.70%. Most notably, under extreme W4A4 quantization where existing approaches fail catastrophically, our method maintains robust performance (55.72% on TextVQA), proving that aggressive multimodal compression is both achievable and practical for real-world deployment. The code is available at this anonymous link: https://anonymous.4open.science/status/MAQ-23971iclrAnonymous
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23971
Loading