Transporting Tokens: Optimal-Transport View of Parallel LLM Decoding

Transporting Tokens: Optimal-Transport View of Parallel LLM Decoding

ICLR 2026 Conference Submission18703 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Parallel Decoding; Inference acceleration; Optimal transport

TL;DR: We reframe autoregressive decoding as a predictable state transition in probability space. Using Optimal Transport, we align hidden states across steps, enabling parallel decoding with up to 5.23× speedups and minimal accuracy loss.

Abstract: Autoregressive decoding is a primary bottleneck for large language models (LLMs), as its inherent sequentiality severely limits inference speed. While speculative decoding methods mitigate this via a draft-and-verification pipeline their effectiveness is severely constrained by dependency on draft model quality and availability. We rethink the generation pattern and introduces a novel theoretical perspective by reframing token generation as a predictable state transition process in probability space, formalized through Optimal Transport (OT) theory. We demonstrate that the temporal consistency of hidden states induces a stable transport map, enabling theoretically grounded multi-step prediction. Building on this insight, we develop SHAPE, an OT-based predictor that implements lightweight Sinkhorn iterations. Extensive evaluations across diverse models (e.g., Qwen, Vicuna, LLaMA, DeepSeek) and tasks (text, code, math) show that SHAPE achieves up to 5.23× speedup with minimal quality loss ($\leq 1.2\%$ accuracy drop), empirically validating our distributional transition hypothesis. This work establishes a new theoretical foundation for understanding autoregressive decoding and a practical path toward high-speed generation beyond token-wise limitations.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18703

Loading