PARD: Accelerating LLM Inference with Low‑Cost PARallel Draft Model Adaptation

ICLR 2026 Conference Submission15276 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Speculative Decoding, AI Infra, Low Cost Training
Abstract: The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a promising solution, adopting a draft-then-verify strategy to accelerate token generation. While the EAGLE series achieves strong acceleration, its requirement of training a separate draft head for each target model introduces substantial adaptation costs. In this work, we propose \textbf{PARD (PARallel Draft)}, a novel speculative decoding method featuring \textit{target-independence} and \textit{parallel token prediction}. Specifically, PARD enables a single draft model to be applied across an entire family of target models without requiring separate training for each variant, thereby minimizing adaptation costs. Meanwhile, PARD substantially accelerates inference by predicting multiple future tokens within a single forward pass of the draft phase. To further reduce the training adaptation cost of PARD, we propose a COnditional Drop-token (COD) mechanism based on the integrity of prefix key-value states, enabling autoregressive draft models to be adapted into parallel draft models at low-cost. Our experiments show that the proposed COD method improves draft model training efficiency by \textbf{3$\times$} compared with traditional masked prediction training. On the \texttt{vLLM} inference framework, PARD achieves up to \textbf{3.67$\times$} speedup on LLaMA3.1-8B, reaching \textbf{264.88} tokens per second, which is \textbf{1.15$\times$} faster than EAGLE-3.
Primary Area: generative models
Submission Number: 15276
Loading