SpecTr: Fast Speculative Decoding via Optimal Transport

Ziteng Sun; Ananda Theertha Suresh; Jae Hun Ro; Ahmad Beirami; Himanshu Jain; Felix Yu; Michael Riley; Sanjiv Kumar

SpecTr: Fast Speculative Decoding via Optimal Transport

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu, Michael Riley, Sanjiv Kumar

Published: 20 Jun 2023, Last Modified: 16 Jul 2023ES-FoMO 2023 PosterEveryoneRevisionsBibTeX

Keywords: autoregressive sampling; computation efficiency; optimal transport

Abstract: Autoregressive sampling from large language models has shown to achieve state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up decoding is *speculative decoding*: use a smaller model to sample a *draft* (block or sequence of tokens), and then score all tokens in the draft by the desired large language model in parallel. The tokens in the draft are either accepted or rejected based on a statistical method to guarantee that the final output is a valid sample from the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with *membership cost*. This framework can be viewed as an extension of the well-known *maximal-coupling* problem. This new formulation enables us to generalize the sampling method to allow for a set of $k$ candidates at the token-level, leading to an improved optimal membership cost. The optimal solution can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose an approximate solution whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of token vocabulary. Using this new OT algorithm, we develop a new autoregressive sampling algorithm called *SpecTr*, which creates multiple drafts of the next few tokens from the small language model, and score all of them in parallel by the large language model. We accept one or reject all of them based on their respective scores. We experimentally demonstrate that the proposed approach achieves a speedup of 3X, a further 1.36X speedup over speculative decoding on standard benchmarks.

Submission Number: 34

Loading