Keywords: Token Tree, Speculative Decoding, Inference
TL;DR: We propose a classifier-based speculative decoding token tree construction method that significantly improves token tree accuracy, as validated across multiple models and benchmarks.
Abstract: With the increasing scale of Large Language Models (LLMs), issues of inference latency and computational costs have become increasingly prominent. Speculative decoding methods have emerged to alleviate these challenges, but existing tree construction strategies exhibit inefficiencies in accurately preparing candidate token trees for the verification stage. To address this, we propose a plug-and-play method named C2T that leverages a lightweight three-feature classifier with only 241 parameters to dynamically generate and pre-prune token trees, which is even applicable to early stopping in token sequence inference. Our approach outperforms traditional probability-based dynamic token tree construction methods while introducing negligible computational overhead. We evaluated our method on multiple benchmarks and models and showed that, when combined with SOTA methods such as EAGLE-2/3, it can reduce the number of candidate tokens by 25% without sacrificing acceptance length, resulting in a 7% to 17% speedup across models of different sizes.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16509
Loading