Self Speculative Decoding for Diffusion Large Language Model

Yifeng Gao; Ziang Ji; Yuxuan Wang; Biqing Qi; Hanlin xu; Linfeng Zhang

Self Speculative Decoding for Diffusion Large Language Model

Yifeng Gao, Ziang Ji, Yuxuan Wang, Biqing Qi, Hanlin xu, Linfeng Zhang

13 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, DLLM, Speculative Decoding

Abstract: Diffusion-based Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive models, offering unique advantages through bidirectional attention mechanisms and iterative denoising processes. However, their practical deployment is hindered by high inference latency, particularly in memory-bound scenarios where traditional acceleration techniques like Key-Value caching are incompatible due to the bidirectional nature of dLLMs. We propose Self Speculative Decoding (SSD), a novel inference acceleration framework that leverages the dLLM itself as both drafter and verifier without requiring auxiliary models. SSD introduces a self-drafting mechanism where the model generates initial predictions for all masked positions in a single forward pass. Our self-speculative approach allows the model to progressively verifiy multiple tokens in a verification tree. Our method supports both greedy linear verification chains and mixed-order strategies that allow jumping predictions, adapting to different accuracy-speed trade-offs. The self-speculative nature eliminates the need for separate draft models, making SSD particularly efficient for deployment.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4670

Loading