AR Models can be Faster and More Accurate Parallel Decoders than Diffusion LLMs

AR Models can be Faster and More Accurate Parallel Decoders than Diffusion LLMs

ICLR 2026 Conference Submission14470 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, LLM inference, parallel decoding

Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts have primarily explored diffusion-based LLMs (dLLM) for parallel decoding to reduce latency while preserving model generation quality. However, non-diffusion approaches remain largely underexplored and it's unanswered whether AR models can be adapted as faster parallel decoders than dLLMs while maintaining generation quality. We present pcLLM, a progressive consistency distillation paradigm that transforms autoregressive (AR) models into efficient parallel decoders while preserving the causal inference property. pcLLM achieves $3.6\times$ wall-clock speedup on coding benchmarks with minimal loss in performance. Based on pcLLM's trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to $4.2\times$ higher token acceptance count per iteration and nearly $4\times$ speedup, effectively trading additional compute for lower inference latency.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14470

Loading