Keywords: Speculative Decoding, LLM
TL;DR: By only using the draft model sometimes you can accept draft tokens more often. This can provide an overall gain in TPS.
Abstract: Speculative decoding is an approach for increasing the Tokens Per Second (TPS) of a base LLM by using a smaller draft model to predict subsequent tokens. These draft tokens can be generated quickly and their verification by the base model can occur in parallel with generating the next token. A key determiner of the impact of SD on TPS is the _acceptance rate_ - how likely a draft token is to be accepted upon verification.
This work explores *Randomised Drafting* wherein a draft is only generated with some probability $a \leq 1$. By introducing this random component we show that the acceptance rate can be boosted while preserving the distributional guarantees of SD. Despite sometimes using the base model directly we show that Randomised Drafting can result in an overall boost in TPS. The improvement in TPS is minor but comes without cost.
Submission Number: 47
Loading