SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

ACL ARR 2026 January Submission2262 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Discrete Diffusion, Blockwise Parallel Decoding, Inference Acceleration, Test-Time Scaling

Abstract: Autoregressive (AR) language modeling remains the dominant paradigm due to its dense supervision signal and highly optimized serving infrastructure, but its strictly causal, token-by-token decoding limits parallelism and non-causal modeling. While masked diffusion offers a promising path toward parallel generation, it faces two critical bottlenecks: training inefficiency stemming from sparse masked objectives, and high latency caused by iterative whole-sequence denoising. We present a systematic study of blockwise discrete diffusion, a pragmatic middle ground that preserves AR-compatible serving while enabling parallel intra-block generation. Our study proceeds in four steps: (i) a \textbf{controlled, compute- and scale-matched comparison} revealing that AR is a more effective backbone for blockwise hybrids than masked diffusion objectives; (ii) a \textbf{scalable conversion recipe, \textsc{SDAR}}, validating that AR models spanning 1.7B to 30B parameters can be adapted into block diffusion models with minimal compute while preserving backbone capabilities; and (iii) a \textbf{systematic characterization of decoding dynamics}, which reveals a virtuous cycle where larger models enable more aggressive parallel decoding, achieving {theoretical speedups over 5$\times$} and {wall-clock speedups of 2.3$\times$} on H200 GPUs in latency-critical regimes; and (iv) an \textbf{investigation of local non-causal modeling capabilities}, showing that SDAR's local bidirectional attention overcomes causal bottlenecks in scientific domains (e.g., chemistry) and enables robust test-time scaling. We release the full model suite, the training framework, and our inference engines for further innovation in non-autoregressive generative paradigms.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency, inference methods, efficient models, model architectures, scaling, generative models, pre-training

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 2262

Loading