Keywords: speculative decoding, inference acceleration, large language models
Abstract: Speculative decoding (SD), where a small draft model is employed to propose *draft* tokens in advance, and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this, we propose *LogitSpec* to effectively expand the retrieval range and find the most relevant reference as drafts. Our *LogitSpec* is motivated by the observation that the logit of the last token can not only predict **the next token**, but also speculate **the next next token**. Specifically, *LogitSpec* generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. *LogitSpec* is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that *LogitSpec* can achieve up to 2.61$\times$ speedup and 3.28 mean accepted tokens per decoding step.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 13250
Loading