WETAP: Speculative Decoding with Width-Entropy Tree and Adaptive Pruning for LLMs Inference Acceleration
Keywords: LLM, Speculative Decoding, Token Tree, Entropy, Pruning
Abstract: In inference acceleration of Large Language Models (LLMs), speculative decoding is used to coordinate draft model and target model, i.e., sequences are generated at the draft model and then verified in parallel at the target model, where the generation quality and speed of the draft model are the key issues. In this paper, we find that in a token tree, most of the child tokens are grew by few parent tokens with large probabilities in the low-entropy layer, and tokens with small probabilities in deeper layers also have potential to be accepted. Based on these observations, we propose WETAP, first constructing a token tree by determining the width of the next layer based on the entropy of the previous layer, then pruning it by considering both the probability and length of each token to retain the most potential ones. Experiments show that the proposed WETAP improves generation performance by up to 90% and furthermore increases up to 120% speed compared to other SOTA methods.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15762
Loading