TL;DR: Motivated by the finding that arbitrary order may limit dLLMs' reasoning potential by bypassing uncertainty, we propose JustGRPO, utilizing standard AR-based RL for better reasoning.
Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs.
We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage.
This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility.
We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead.
Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1\% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Code: https://github.com/LeapLabTHU/JustGRPO.
Lay Summary: A new kind of AI text generator, called a diffusion language model, can write words in any order rather than strictly left to right. Many researchers assumed this freedom would help the model reason better, since it can fill in whichever word it feels most sure about first.
We found the opposite. When solving math or coding problems, the model uses this freedom to dodge the hardest decisions, like the word "Therefore" that decides which way an argument turns. It answers the easy parts first, and by the time it returns to the hard word, the answer is already locked in. This quietly shrinks the range of solutions the model can discover.
So we simply removed the freedom and trained the model the old fashioned way, left to right. This simpler recipe reasons better while keeping the model's fast, parallel writing speed intact.
Primary Area: Deep Learning->Large Language Models
Keywords: Diffusion Language Models, Reasoning, Reinforcement Learning, Large Language Models
Link To Code: https://github.com/LeapLabTHU/JustGRPO
Originally Submitted PDF: pdf
Submission Number: 2358
Loading