Keywords: Parallel Decoding; Inference acceleration; Optimal transport
TL;DR: We reframe autoregressive decoding as a predictable state transition in probability space. Using Optimal Transport, we align hidden states across steps, enabling parallel decoding with up to 5.23× speedups and minimal accuracy loss.
Abstract: Autoregressive decoding is a primary bottleneck for large language models (LLMs), as its inherent sequentiality severely limits inference speed. While speculative decoding methods mitigate this via a draft-and-verification pipeline their effectiveness is severely constrained by dependency on draft model quality and availability. We rethink the generation pattern and introduces a novel theoretical perspective by reframing token generation as a predictable state transition process in probability space, formalized through Optimal Transport (OT) theory. We demonstrate that the temporal consistency of hidden states induces a stable transport map, enabling theoretically grounded multi-step prediction. Building on this insight, we develop SHAPE, an OT-based predictor that implements lightweight Sinkhorn iterations. Extensive evaluations across diverse models (e.g., Qwen, Vicuna, LLaMA, DeepSeek) and tasks (text, code, math) show that SHAPE achieves up to 5.23× speedup with minimal quality loss ($\leq 1.2\%$ accuracy drop), empirically validating our distributional transition hypothesis. This work establishes a new theoretical foundation for understanding autoregressive decoding and a practical path toward high-speed generation beyond token-wise limitations.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18703
Loading