Keywords: Efficient LLM Inference, Speculative Decoding, Pipeline Parallel
Abstract: Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost for the auto-regressive decoding manner.
Early-exit based speculative decoding (EESD) has emerged to reduce decoding latency.
However, in practice, many approaches struggle to achieve an expected acceleration in the draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position.
Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM.
Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup.
To mitigate this, we propose \textbf{Pipeline-Parallel Speculative Decoding (PPSD)} that fully pipelines the draft and verification work so that no effort is wasted on failed predictions.
It has two key innovations.
Pipeline-Parallel Early-Exit Execution: We design a fine-grained pipeline allocation and execute system, in which early-exit (draft) computations and remaining-layer (verification) computations overlap with minimal blocking waste.
Verify-while-draft Decoding: We interleave drafting and verification per token.
While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token.
This high parallel scheme keeps all units busy and validates tokens on-the-fly, analogous to pipelining the speculation and verification stages.
Each token is confirmed as soon as it enters the output, ensuring correctness without stalling.
All these design choices are supported by both theoretical analysis of pipelined throughput and extensive experiments.
Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference.
On diverse benchmarks, PPSD achieves speedup ratios in the range of $2.01\times\sim3.81\times$, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficiency.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6521
Loading