Keywords: LLM, DLLM, Speculative Decoding
Abstract: Diffusion-based Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive models, offering unique advantages through bidirectional attention mechanisms and iterative denoising processes. However, their practical deployment is hindered by high inference latency, particularly in memory-bound scenarios where traditional acceleration techniques like Key-Value caching are incompatible due to the bidirectional nature of dLLMs. We propose Self Speculative Decoding (SSD), a novel inference acceleration framework that leverages the dLLM itself as both drafter and verifier without requiring auxiliary models. SSD introduces a self-drafting mechanism where the model generates initial predictions for all masked positions in a single forward pass. Our self-speculative approach allows the model to progressively verifiy multiple tokens in a verification tree. Our method supports both greedy linear verification chains and mixed-order strategies that allow jumping predictions, adapting to different accuracy-speed trade-offs. The self-speculative nature eliminates the need for separate draft models, making SSD particularly efficient for deployment.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4670
Loading