Keywords: large language models, machine learning efficiency, speculative decoding
Abstract: Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a smaller model to draft a block of tokens which are verified in parallel by the large model, and provides a guarantee that the output is distributed identically to a sample from the large model. In prior works, draft verification is performed token-by-token independently. Surprisingly, we show that this approach is not optimal. We propose *block verification*, a simple, easy-to-implement improved draft verification algorithm that provides additional wall-clock speedup by verifying the entire block jointly. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification.
Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5\%-8\% in a wide range of tasks and datasets.
Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default by speculative decoding implementations.
Submission Number: 54
Loading