Keywords: Large Language Models, Lossless Acceleration, Speculative Sampling
Abstract: Speculative sampling has emerged as a promising approach for accelerating large language models (LLMs) inference by leveraging a lightweight draft model to propose multiple candidate tokens, which are then verified in parallel by a target model.
Recent methods enhance this process by structuring candidate sequences into a token tree for more efficient verification.
However, existing tree construction methods overly rely on acceptance length as a proxy for speedup.
Such an indirect pursuit renders it challenging to achieve the optimal tree structure for maximum speedup.
In this paper, we first revisit prior approaches and find they suffer from two key limitations: analytical intractability and the assumption of node independence.
We then redefine the costs and benefits of each tree node, derive a function that characterizes the relationship between time reduction and draft length, and prove its convexity.
Finally, we extend this analytical framework to tree structures and propose a general principle for tree construction aimed at maximizing speedup.
Applying this principle to state-of-the-art tree-based speculative sampling methods consistently delivers significant gains, improving overall performance by 4% to 14% and achieving end-to-end speedup of 1.97× to 2.68×.
The implementation is publicly available at:
https://anonymous.4open.science/r/GTCP-CC76/README.md.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2959
Loading