Keywords: speculative decoding, large language model
Abstract: Speculative decoding (SpD) has emerged as a promising approach to accelerate the slow autoregressive inference of large language models (LLMs).
SpD leverages a lightweight draft model to propose candidate tokens, which are then verified in parallel by the target LLM.
Recent advances in tree-based SpD significantly improve efficiency by drafting token trees, enabling the verification of multiple sequences at once.
Given its strong empirical performance reported across numerous studies, tree-based SpD is rapidly becoming dominant.
However, existing draft model training methods overlook the tree structure when defining the training objectives, causing their training and inference distributions to become misaligned.
We address this limitation with a tree-aware loss function (TALF) that explicitly incorporates the tree structure into draft model training.
Using trees generated by the target LLM, TALF aligns the draft model’s predictions with the target across all branches, mitigating the misalignment.
Further, we improve the tree construction process in drafting with stopping at low further gains (SALF).
As drafting iterations search for potential high-probability tokens to add to the tree,
we estimate aggregate probability gains.
This estimate guides the stopping criterion for drafting, enabling us to balance computational cost against draft quality for maximum performance.
Together, SALF \& TALF deliver 15.6--39.4\% and 6.5--24.4\% end-to-end speedups over state-of-the-art SpD methods, EAGLE-2 and HASS, without altering the draft model architecture.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 16563
Loading