Keywords: diffusion, dllm, llm, grpo, rl, reasoning, policy gradient, post-training, non-autoregressive
Abstract: Diffusion large language models (dLLMs) are a new paradigm of non-autoregressive language models that are trained to predict multiple tokens in parallel and generate text via iterative unmasking. Recent works have successfully pretrained dLLMs to parity with autoregressive LLMs at the 8B scale, but dLLMs have yet to benefit from modern post-training techniques, e.g. reinforcement learning (RL), that have proven effective for autoregressive models. Crucially, current algorithms aren't directly compatible with diffusion models due to their lack of left-to-right sequence likelihood factorization. Moreover, existing attempts at dLLM post-training with RL rely on unprincipled heuristics such as mean-field approximations. In this work, we present Amortized Group Relative Policy Optimization (AGRPO), an on-policy RL algorithm designed specifically for dLLMs. Our key insight is that by casting the denoising process as a multi-step Markov decision process, we can use Monte Carlo sampling to compute an unbiased policy gradient estimate, making AGRPO the first tractable yet faithful adaptation of policy gradient methods for dLLMs. We demonstrate AGRPO's effectiveness on different math/reasoning tasks, achieving up to +10.0\% absolute gain on GSM8K, 3.8x performance on the Countdown task over the baseline LLaDA model, and 3.4x performance gains over comparable RL methods such as diffu-GRPO. Furthermore, these gains persist across different numbers of sampling steps at inference time, achieving better tradeoffs between compute and performance. Our results establish that online RL algorithms can be extended to diffusion LLMs in principled ways, maintaining both theoretical soundness and practical effectiveness.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23639
Loading