Keywords: Speculative Decoding, LLM inference
Abstract: Speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small, efficient draft model to propose draft tokens in advance, and subsequently validating them in parallel with the large target model. However, the existing SD methods still remain fundamentally constrained by their serialized execution, which inevitably causes mutual waiting bubbles between the draft and target models. To address this critical challenge, we draw inspiration from sophisticated branch prediction mechanisms in modern processors and propose a novel framework, \textbf{SpecBranch}, to fully unlock branch parallelism in SD. Specifically, we first conduct an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the intricate trade-offs between parallelization and token rollback. Based on this analysis, we introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to significantly enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments conducted across various models and benchmarks show that \textbf{SpecBranch} achieves impressive speedups of over \textbf{1.8}$\times \sim$ \textbf{4.5}$\times$ against the standard auto-regressive decoding and reduces rollback tokens by \textbf{50}\% for poorly aligned models, while maintaining an identical sampling distribution. Our code is available at \url{https://github.com/Sylvan820/Specbranch}.
Primary Area: generative models
Submission Number: 12802
Loading