TL;DR: We propose a framework to fine-tune a diffusion LLM through distribution matching policy optimization.
Abstract: Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is crucial to enabling dLLMs to achieve performance comparable to that of AR-LLMs on important tasks, such as reasoning. However, RL algorithms well-suited to dLLMs' unique characteristics have yet to be developed. This paper proposes \textbf{Distribution Matching Policy Optimization (DMPO)}, a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key implementation challenge with small training batch sizes and propose several effective solutions based on a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, achieving up to a $39.63$ percentage-point improvement in accuracy over prior non-DMPO RL baselines and $67.97$ percentage points over the base model, underscoring the effectiveness of the distribution-matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.
Lay Summary: Many language models answer questions by writing one word after another, which can make them slow. A newer family, called diffusion language models, works more like filling in blanks and revising an answer, which could make them faster. However, these models still need better ways to learn difficult skills such as solving math problems and puzzles.
We developed a training method called Distribution Matching Policy Optimization (DMPO) that teaches these models from their own attempted answers. Instead of rewarding only one best answer, DMPO encourages the model to learn from a range of good answers, so it can explore different useful reasoning paths. We also found and fixed a practical problem: when only a few attempted answers are available during training, the model may accidentally learn from bad answers too. Our correction helps the model strengthen good answers and discourage poor ones.
In tests on math and planning tasks, DMPO made two diffusion language models much better reasoners while keeping their promise of efficient training and faster generation.
Link To Code: https://github.com/yuchen-zhu-zyc/DMPO
Primary Area: Probabilistic Methods
Keywords: Fine-tuning, Diffusion Large Language Model, Policy Optimization
Originally Submitted PDF: pdf
Submission Number: 1251
Loading