Keywords: diffusion language models, parallel decoding
TL;DR: We diagnose that Diffusion Language Models (DLMs) struggle to achieve true parallel decoding because they are trained on inherently sequential data, causing them to revert to autoregressive (left-to-right) behavior.
Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical ``fast'' DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR’s sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose \textbf{NAP-D} (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with parallel decoding. NAP-D curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP-D yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 127
Loading