Principled and Tractable RL for Reasoning with Diffusion Language Models

Anthony Zhan

Principled and Tractable RL for Reasoning with Diffusion Language Models

Anthony Zhan

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion, dllm, llm, grpo, rl, reasoning, policy gradient, post-training, non-autoregressive

Abstract: Diffusion large language models (dLLMs) are a new paradigm of non-autoregressive language models that are trained to predict multiple tokens in parallel and generate text via iterative unmasking. Recent works have successfully pretrained dLLMs to parity with autoregressive LLMs at the 8B scale, but dLLMs have yet to benefit from modern post-training techniques, e.g. reinforcement learning (RL), that have proven effective for autoregressive models. Crucially, current algorithms aren't directly compatible with diffusion models due to their lack of left-to-right sequence likelihood factorization. Moreover, existing attempts at dLLM post-training with RL rely on unprincipled heuristics such as mean-field approximations. In this work, we present Amortized Group Relative Policy Optimization (AGRPO), an on-policy RL algorithm designed specifically for dLLMs. Our key insight is that by casting the denoising process as a multi-step Markov decision process, we can use Monte Carlo sampling to compute an unbiased policy gradient estimate, making AGRPO the first tractable yet faithful adaptation of policy gradient methods for dLLMs. We demonstrate AGRPO's effectiveness on different math/reasoning tasks, achieving up to +10.0\% absolute gain on GSM8K, 3.8x performance on the Countdown task over the baseline LLaDA model, and 3.4x performance gains over comparable RL methods such as diffu-GRPO. Furthermore, these gains persist across different numbers of sampling steps at inference time, achieving better tradeoffs between compute and performance. Our results establish that online RL algorithms can be extended to diffusion LLMs in principled ways, maintaining both theoretical soundness and practical effectiveness.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23639

Loading