MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization

JiangyongYu; Sifan Zhou; Dawei Yang; Shuoyu Li; Shuo Wang; Xing Hu; XUCHEN; Zukang Xu; Changyong Shu; Zhihang Yuan

MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization

JiangyongYu, Sifan Zhou, Dawei Yang, Shuoyu Li, Shuo Wang, Xing Hu, XUCHEN, Zukang Xu, Changyong Shu, Zhihang Yuan

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Quantization

Abstract: Recently, multimodal large language models (MLLMs) have garnered widespread attention due to their ability to perceive and understand multimodal signals. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application. While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we conduct an in-depth analysis of MLLMs quantization and identify several challenges: slow inference speed of the visual tokens, distributional differences across modalities, and visual outlier clipping degrades performance. To address these challenges, we propose **MQuant**, a quantization framework tailored for MLLMs. Specifically, 1) we design Modality-specific Quantization (MSQ) and Attention-Invariant Flexible Switching (AIFS) to support per-tensor static quantization and facilitate efficient inference. 2) we introduce a unified LayerNorm-to-RMSNorm transformation, achieving seamless integration of the MLLM vision encoder with Hadamard rotation. 3) we propose Rotation Magnitude Suppression (RMS) to mitigate outliers introduced by Hadamard rotation. Experiments conducted on five mainstream MLLMs demonstrate the superior performance and broad applicability of MQuant. For example, it maintains around 98\% of the floating-point accuracy under the W4A8 setting. To the best of our knowledge, **MQuant** is the first quantization solution for MLLMs, paving the way for future advancements in their application.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4416

Loading