Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Published: 19 Mar 2026, Last Modified: 20 May 2026MLSys 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV Cache, Quantization, GPU, Algorithm-System Co-design, Large Language Model
TL;DR: We propose Kitty, an algorithm-system co-design for 2-bit KV cache quantization that cuts memory by nearly 8x and boosts LLM throughput with negligible accuracy loss.
Abstract: The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm–system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost — which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision — maintains near-zero drop in accuracy while approaching 2-bit memory. On the system side, the primary challenge lies in managing these dynamic 4-bit channel boosts without compromising memory efficiency or the execution speed of attention layers. Kitty addresses this through a hardware-aware memory layout and highly optimized system designs, ensuring that our on-the-fly KV quantization incurs negligible runtime overhead while maximizing memory footprint reduction. This synergistic design allows Kitty to unlock the full potential of 2-bit quantization without sacrificing real-time inference throughput. Specifically, Kitty addresses these issues by decomposing each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that reduces and amortizes the runtime overhead. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly $8\times$ with negligible accuracy loss, enabling up to $8\times$ larger batches and $2.1\times$–$4.1\times$ higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.
Topics: Algorithms: Efficient algorithms for serving LLMs and generative models, Algorithms: Model-compression-aware training and inference, Model Serving: Compression, quantization, pruning, distillation at system scale, Model Serving: System optimizations for model serving
Submission Number: 34
Loading