Self-Speculative Decoding Accelerates Lossless Inference in Any-Order and Any-Subset Autoregressive Models
Keywords: speculative decoding, any-order autoregressive models, diffusion language models
TL;DR: We present an algorithm that provably accelerates inference in any-order autoregressive models, without loss of quality.
Abstract: In arbitrary-order language models, it is an open question how to sample tokens
in parallel from the correct joint distribution. With discrete diffusion models, the
more tokens they generate in parallel, the less their predicted distributions adhere
to the originally learned data distribution, as they rely on a conditional independence
assumption that only works with infinitesimally small timesteps. We find
that a different class of models, any-subset autoregressive models (AS-ARMs),
holds the solution. As implied by the name, AS-ARMs can generate tokens in any
order, and in parallel. Moreover, AS-ARMs support parallelized joint probability
density estimation, allowing them to correct their own parallel-generated token
distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD
provably enables generation of tokens from the correct joint distribution, with the
number of neural network calls upper bounded by the number of tokens predicted
– notably, previous speculative decoding algorithms lack our efficiency guarantee.
We empirically verify that ASSD speeds up language generation, without
sacrificing quality. Furthermore, we provide a mathematically justified scheme for
training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art
performance among sub-200M parameter models on infilling benchmark tasks,
and nearly match the performance of models 50X larger on code generation. Our
theoretical and empirical results indicate that the once-forgotten AS-ARMs are a
promising direction of language modeling.
Primary Area: generative models
Submission Number: 13411
Loading