Keywords: LLM, Speculative Decoding, AI Infra, Low Cost Training
Abstract: The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a promising solution, adopting a draft-then-verify strategy to accelerate token generation. While the EAGLE series achieves strong acceleration, its requirement of training a separate draft head for each target model introduces substantial adaptation costs. In this work, we propose **PARD (PARallel Draft)**, a novel speculative decoding method featuring target-independence and parallel token prediction. Specifically, PARD enables a single draft model to be applied across an entire family of target models without requiring separate training for each variant, thereby minimizing adaptation costs. Meanwhile, PARD substantially accelerates inference by predicting multiple future tokens within a single forward pass of the draft phase. To further reduce the training adaptation cost of PARD, we propose a COnditional Drop-token (COD) mechanism based on the integrity of prefix key-value states, enabling autoregressive draft models to be adapted into parallel draft models at low-cost. Our experiments show that the proposed COD method improves draft model training efficiency by 3x compared with traditional masked prediction training. On the vLLM inference framework, PARD achieves up to 3.67x speedup on LLaMA3.1-8B, reaching 264.88 tokens per second, which is 1.15x faster than EAGLE-3. Our code is available at https://github.com/AMD-AGI/PARD.
Primary Area: generative models
Submission Number: 15276
Loading