QAQ: Query-adaptive Mixed-precision Quantization for Large Language Models

NeurIPS 2025 Workshop MLForSys Submission68 Authors

Published: 30 Oct 2025, Last Modified: 14 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Efficient LLM; LLM Quantization
Abstract: Large language models (LLMs) achieve strong performance, yet inference is still bounded by trade-offs between efficiency and accuracy. While quantization cuts memory and latency, it fails to flexibly accommodate heterogeneous inputs. We introduce Query-Aware Quantization (QAQ), a dynamic-precision scheme that decomposes model weights into bit-planes, employs a trainable router for query-conditioned precision selection, and supports on-demand CPU$\leftrightarrow$GPU loading. On Qwen3 and LLaMA-3.1, QAQ matches the accuracy of 8-bit baselines while reducing GPU memory footprint, with an associated latency overhead. These results suggest that QAQ offers a practical operating point on the efficiency–accuracy frontier for LLM inference.
Submission Number: 68
Loading