Beyond Next-Token Prediction: Diffusion vs. Autoregressive Reasoning in LLMs

18 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion LLMs, Autoregressive LLMs, Reasoning, Robustness, Test-Time Scaling
Abstract: We revisit LLM reasoning through two competing decoding paradigms: autoregressive large language models (AR LLMs) using next-token prediction, and diffusion-based large language models (DLLMs) using iterative denoising; yet the community lacks compute-controlled, apples-to-apples comparisons. We recast reasoning as trajectory formation, contrasting sequential commitment in AR LLMs with iterative refinement in DLLMs, and run a matched-scale study across mathematics, logic, natural-language inference, and commonsense QA, with robustness and efficiency analyses. Empirically, DLLMs surpass AR LLMs on most reasoning benchmarks, particularly when global constraints and long-range coherence matter, while AR LLMs remain competitive on short, commonsense-style problems. Mechanistic analyses show DLLMs repair early errors and enforce sequence-wide consistency; robustness experiments reveal graceful degradation under prompt noise and distribution shift. We quantify accuracy–efficiency trade-offs: DLLMs increase FLOPs and wall-clock latency; compute-matched comparisons preserve their advantage, indicating benefits arise from the generative mechanism rather than added budget. Ablations surface practical settings, including low-confidence remasking, moderate guidance, and moderate schedules, that retain accuracy while curbing latency. The results translate into actionable guidance, specifying when to allocate compute to DLLMs and when to favor AR LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12206
Loading