Keywords: large language models, efficient decoding, speculative decoding
TL;DR: A drop-in mechanism to improve draft token utilization in Speculative Decoding for accelerate LLM inference losslessly
Abstract: Large language models (LLMs) deliver strong generative performance but suffer from high inference latency. Speculative Decoding (SD) accelerates inference by allowing a fast draft model to propose tokens, which are then verified in parallel by a larger target model. While SD provides lossless acceleration while preserving identical generation quality, its key challenge lies in draft token efficiency: ensuring that as many drafted tokens as possible are converted into useful tokens in the final output. We present a holistic token-efficient SD strategy built on two complementary mechanisms. \textit{Ex-post utilization (Post-use)} employs a token cache to recycle and reuse useful drafts in subsequent forward passes. \textit{Ex-ante reduction (Pre-cut)} adaptively controls draft length, preventing overproduction when the marginal benefit falls below the cost. Together, these mechanisms both reuse what has been produced and eliminate what should not be produced. Experiments show 2.52–3.23$\times$ overall speedup over auto-regressive decoding and over 20\% higher token utilization than vanilla SD methods.
Primary Area: generative models
Submission Number: 2943
Loading