Keywords: Efficient LLM; LLM Quantization
Abstract: Large language models (LLMs) achieve strong performance, yet inference is still bounded by trade-offs between efficiency and accuracy.
While quantization cuts memory and latency, it fails to flexibly accommodate heterogeneous inputs.
We introduce Query-Aware Quantization (QAQ), a dynamic-precision scheme that decomposes model weights into bit-planes, employs a trainable router for query-conditioned precision selection, and supports on-demand CPU$\leftrightarrow$GPU loading.
On Qwen3 and LLaMA-3.1, QAQ matches the accuracy of 8-bit baselines while reducing GPU memory footprint, with an associated latency overhead.
These results suggest that QAQ offers a practical operating point on the efficiency–accuracy frontier for LLM inference.
Submission Number: 68
Loading