EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Published: 01 Jun 2026, Last Modified: 09 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Language Models, KV Cache, Inference Acceleration, Adaptive Inference
TL;DR: A training-free KV caching method for diffusion LLMs that replaces expensive per-layer drift detection with a single entropy scalar, achieving up to 26.4× speedup while keeping decision overhead at 0.5% of inference time.
Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding *when* to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves 15.2×–26.4× speedup on standard benchmarks and 22.4×–24.1× on chain-of-thought benchmarks against vanilla baselines, with competitive accuracy and decision overhead accounting for only 0.5% of inference time.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 64
Loading