LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

TMLR Paper6016 Authors

27 Sept 2025 (modified: 12 Jun 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. Post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision, its effectiveness for multimodal LLMs (MLLMs) remains unexplored. In this paper, we present \rev{the first method for ultra-low-bit} (<4-bit) quantization of MLLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly \rev{higher entropy} compared to text tokens, indicating greater functional complexity that makes MLLMs less tolerant to ultra-low bit quantization. However, this entropy varies significantly across layers, with some layers producing lower-entropy activation distributions that we empirically show can better tolerate ultra-low bit quantization. Existing PTQ methods optimize weight quantization within each layer but apply the same target precision uniformly, ignoring this variation in complexity across layers. Building on this insight, we propose LUQ: Layerwise Ultra-Low Bit Quantization, which characterizes each transformer layer's functional complexity via its output activation entropy and selectively applies ultra-low bit quantization to layers encoding simpler, more compressible functions. We also show that multimodal calibration (image and text tokens) boosts VQA performance in the ultra-low bit regime. Evaluated on LLaVA-1.5 and Qwen-2.5-VL across 9 VQA benchmarks, LUQ models use 40\% and 31\% less memory than their 4-bit counterparts while exhibiting less than 10\% degradation on MME.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: All changes made in this revision and the previous one are highlighted in blue. Updates in this revision: - Added TextVQA and GQA results for OmniQuant and SlimLLM baselines Updates in previous revision: - Softened causal framing of entropy to empirical proxy (Sec. 1) - Added OmniQuant and SlimLLM baselines (Table 1) - Added std. deviations (Tables 1, 2) - "First study" → "first method"; removed imprecise "variance" from abstract - Added K-means/Shannon entropy motivation and Hadamard invariance discussion (Sec. 3.2); QuaRot/ResQ references (Sec. 2.2) - Added hyperparameter K sensitivity analysis (Appendix B.2) - Added deployment/latency evaluation with llama.cpp (Appendix D)
Assigned Action Editor: ~Jeffrey_Pennington1
Submission Number: 6016
Loading