Beyond: Better-than-Full-Precision KV Caches via Learnable Non-Uniform Quantization as Implicit Regularization

Beyond: Better-than-Full-Precision KV Caches via Learnable Non-Uniform Quantization as Implicit Regularization

ACL ARR 2026 May Submission17171 Authors

26 May 2026 (modified: 16 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: KV cache quantization, non-uniform quantization, LLM inference, long-context decoding, efficient NLP, compression

Abstract: Key--value (KV) caching enables efficient autoregressive decoding, but long-context serving is dominated by KV-cache memory. Existing KV-cache quantizers are typically calibrated with local reconstruction or distortion losses, even though generation quality depends on how quantization perturbations propagate through subsequent model computation. We propose Beyond, a learnable non-uniform KV quantizer that trains representation levels and decision thresholds directly on language-model loss while keeping the deployment hard quantizer in the forward pass. The backbone remains frozen; optimization uses standard straight-through estimator input gradients and hard-forward boundary pseudo-gradients for thresholds. On Ministral-3-14B-Instruct-2512-BF16, Llama-3.1-8B-Instruct, and Qwen3-30B-A3B-Instruct-2507, 4-bit/group-32 Beyond reduces evaluation Pile perplexity relative to full-precision PPL and matches or improves upon full precision across seven downstream benchmarks. On a B300 GPU, our packed-cache decode path serves the learned cache directly and achieves up to 1.43$\times$ single-token decode speedup over FlashAttention-4.

Paper Type: Long

Research Area: Efficient Methods for NLP

Research Area Keywords: quantization, LLM efficiency, NLP in resource-constrained settings

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-compute settings (efficiency), Theory

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: no

Submission Number: 17171

Loading