SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

ACL ARR 2024 June Submission4372 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) become integral to advancing NLP tasks, their sequential decoding becomes a bottleneck to achieving more efficient inference. Multi-Draft Speculative Decoding (MDSD) emerges as a promising solution, where a small draft model produces a tree of tokens with each path as a draft predicting the target LLM's outputs, which is then verified by the target LLM in parallel. However, current methods rely on Recursive Rejection Sampling (RRS) and its variants, which suffer from low acceptance rates in proceeding drafts, diminishing the merits of multiple drafts. In this work, we investigate this critical inefficiency and sub-optimality through an optimal transport (OT) formulation that aims to maximize the acceptance rate by optimizing the joint distribution $\pi(x_{1:k},y)$ of $k$-draft tokens $x_{1:k}$ and an accepted token $y$. We show that the OT can be greatly simplified to a much smaller Linear Programming (LP) focusing on a few probabilities in $\pi(x_{1:k},y)$. Moreover, our analysis of different choices for the marginal distribution $Q(x_{1:k})$ reveals its importance to the sampling effectiveness and efficiency. Motivated by the new insight, we introduce SpecHub, which adopts a special design of $Q(x_{1:k})$ that significantly accelerates the LP and provably achieves a higher acceptance rate than existing strategies. SpecHub can be seamlessly integrated into existing MDSD frameworks, improving their acceptance rate while only incurring linear computational overhead. In extensive experiments, Spechub consistently generates 0.05-0.27 and 0.02-0.16 more tokens per step than RRS with and without replacement and achieves equivalent batch efficiency with half as much concurrency. We attach our code at \url{anonymous.4open.science/r/SpecHub}.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Generation, Efficiency in Model Algorithms Training and Inference

Contribution Types: Approaches low compute settings-efficiency, Theory

Languages Studied: English

Submission Number: 4372

Loading