Parallel Token Generation for Language Models

ICLR 2026 Conference Submission14536 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformer, autoregressive model, multi-token prediction, generative model, large language models
TL;DR: An efficient and universal framework to jointly predict multiple tokens in a single transformer call.
Abstract: Autoregressive transformers are the backbone of modern large language models. Despite their success, inference remains slow due to strictly sequential prediction. Prior attempts to predict multiple tokens per step typically impose independence assumptions across tokens, which limits their ability to match the full expressiveness of standard autoregressive models. In this work, we break this paradigm by proposing an efficient and universal framework to jointly predict multiple tokens in a single transformer call, without limiting the representational power. Inspired by ideas from inverse autoregressive normalizing flows, we convert a series of random variables deterministically into a token sequence, incorporating the sampling procedure into a trained model. This allows us to train parallelized models both from scratch and by distilling an existing autoregressive model. Empirically, our distilled model matches its teacher's output for an average of close to 50 tokens on toy data and 5 tokens on a coding dataset, all within a single forward pass.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14536
Loading