AURA: Augmented Representation for Unified Accuracy-aware Quantization

16 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Microscaling Formats, Post-training Quantization, Large Language Models
TL;DR: AURA enhances quantization accuracy by fusing error compensation into a single, hardware-friendly GEMM operation via an augmented matrix.
Abstract: Low-bit quantization is essential for the efficient deployment of Large Language Models (LLMs), but it often degrades model accuracy due to activation outliers concentrated in a few critical channels. Current solutions face a trade-off: Transformation-based methods generate significant additional overhead due to offline parameter learning and online execution; while efficient mixed-precision schemes, despite their high efficiency, lack forward compatibility because they rely on customized General Matrix Multiplication (GEMM) kernels. To address this dilemma, we introduce AURA (Augmented Representation for Unified Accuracy-aware Quantization). AURA employs a theoretically-grounded, accuracy-aware metric to identify critical channels and compensates for their quantization errors by constructing a unified, low-bit augmented matrix. This design decouples the error compensation logic from the underlying GEMM kernel, enabling high performance on current hardware while ensuring robust adaptability to future architectures. Our experiments on Llama and Qwen models demonstrate that AURA achieves state-of-the-art results for 4-bit quantization across a wide range of benchmarks. In system performance, our framework significantly outperforms TensorRT-FP8 baselines, achieving a nearly 3-fold reduction in prefill latency and reducing peak decoding memory to one-third of the FP16 baseline.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 7074
Loading