Keywords: LLMs, LLM inference, parallel decoding
Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts have primarily explored diffusion-based LLMs (dLLM) for parallel decoding to reduce latency while preserving model generation quality. However, non-diffusion approaches remain largely underexplored and it's unanswered whether AR models can be adapted as faster parallel decoders than dLLMs while maintaining generation quality. We present pcLLM, a progressive consistency distillation paradigm that transforms autoregressive (AR) models into efficient parallel decoders while preserving the causal inference property. pcLLM achieves $3.6\times$ wall-clock speedup on coding benchmarks with minimal loss in performance. Based on pcLLM's trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to $4.2\times$ higher token acceptance count per iteration and nearly $4\times$ speedup, effectively trading additional compute for lower inference latency.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14470
Loading