QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Rishabh Tiwari; Haocheng Xi; Aditya Tomar; Coleman Richard Charles Hooper; Sehoon Kim; Maxwell Horton; Mahyar Najibi; Michael W. Mahoney; Kurt Keutzer; Amir Gholami

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Richard Charles Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a novel self-speculative decoding framework to accelerate long-context inference using KV cache and weight quantization.

Abstract: Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90\%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.

Lay Summary: Running large language models (LLMs) on devices like phones or laptops is becoming more common, especially for tasks that involve long conversations or documents. However, this is slow and memory-intensive, mainly because the model needs to repeatedly access a large memory cache at each step. We introduce QuantSpec, a new method that speeds up this process by using a smaller, compressed version of the model’s memory, without sacrificing quality. By using a lightweight version of the model to make fast guesses and then checking them with the full model, QuantSpec achieves up to 2.5× faster performance and reduces memory use by 1.3×, while still getting accurate results most of the time.

Primary Area: Deep Learning->Large Language Models

Keywords: speculative decoding, quantization, long-context inference

Submission Number: 9485

Loading