Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

Published: 26 May 2026, Last Modified: 08 Jun 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion LLM, Speculative Decoding, Acceleration, Inference, Efficiency
TL;DR: Speculative decoding for diffusion LLM acceleration through optimized, calibrated draft graphs that capture the model's decoding dynamics.
Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates by capturing the underlying decoding dynamics, and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its acceleration of LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking, leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.
Submission Number: 76
Loading