Keywords: Large Language Models, Discrete Diffusion, Blockwise Parallel Decoding, Inference Acceleration, Test-Time Scaling
Abstract: Autoregressive (AR) language modeling remains the dominant paradigm due to its dense supervision signal and highly optimized serving infrastructure, but its strictly causal, token-by-token decoding limits parallelism and non-causal modeling.
While masked diffusion offers a promising path toward parallel generation, it faces two critical bottlenecks: training inefficiency stemming from sparse masked objectives, and high latency caused by iterative whole-sequence denoising.
We present a systematic study of blockwise discrete diffusion, a pragmatic middle ground that preserves AR-compatible serving while enabling parallel intra-block generation.
Our study proceeds in four steps:
(i) a \textbf{controlled, compute- and scale-matched comparison} revealing that AR is a more effective backbone for blockwise hybrids than masked diffusion objectives;
(ii) a \textbf{scalable conversion recipe, \textsc{SDAR}}, validating that AR models spanning 1.7B to 30B parameters can be adapted into block diffusion models with minimal compute while preserving backbone capabilities; and
(iii) a \textbf{systematic characterization of decoding dynamics},
which reveals a virtuous cycle where larger models enable more aggressive parallel decoding, achieving {theoretical speedups over 5$\times$} and {wall-clock speedups of 2.3$\times$} on H200 GPUs in latency-critical regimes; and
(iv) an \textbf{investigation of local non-causal modeling capabilities}, showing that SDAR's local bidirectional attention overcomes causal bottlenecks in scientific domains (e.g., chemistry) and enables robust test-time scaling.
We release the full model suite, the training framework, and our inference engines for further innovation in non-autoregressive generative paradigms.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM Efficiency, inference methods, efficient models, model architectures, scaling, generative models, pre-training
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2262
Loading