AutoMixQ: Automatic Mixed-precision Quantization for Deploying Bit-Efficient LLMs

ICLR 2026 Conference Submission18182 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Quantization, Mixed-precision
TL;DR: This paper present an automatic mixed-precision quantization for low bit-width in large language models.
Abstract: Quantization has become a critical technique for efficiently deploying large language models (LLMs), as their massive size makes full-precision inference impractical on most hardware. Among various quantization strategies, 4-bit post-training quantization (PTQ) strikes a favorable balance between compression and performance for hardware-accelerated deployment. Further reducing precision below 4 bit would further increase efficiency, but often leads to severe performance degradation. This dilemma stems from two overlooked issues: 1) most PTQ methods primarily focus on reducing quantization error caused by salient channels with larger magnitude, while neglecting the importance of unremarkable channels with lower magnitude but high semantic relevance; 2) most PTQ methods typically apply quantization based on layer-wise reconstruction loss, failing to account for the cumulative and interdependent effects across layers. In this work, we present AutoMixQ, an automatic mixed-precision quantization to address the two key challenges of sub-4-bit quantization. Instead of focusing solely on salient channels, AutoMixQ considers both salient and unremarkable channels by introducing mixed-bit strategies that capture diverse quantization sensitivities across channels. AutoMixQ first constrains the search space derived from prior empirical observations for stability and efficiency. Then, it conducts an automatic search guided by distillation and global losses that model intra- and inter-layer dependencies, achieving holistic optimized performance. Experiments on several LLMs demonstrate that AutoMixQ achieves better accuracy under low-bit settings, outperforming existing methods by producing more balanced and effective bit allocation.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18182
Loading