Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Yinjie Wang; Ling Yang; Bowen Li; Ye Tian; Ke Shen; Mengdi Wang

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang

Published: 26 Jan 2026, Last Modified: 28 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Language Models, Reinforcement Learning

Abstract: The extension of diffusion models to language tasks has shown promising results, but their post-training methods remain largely unexplored. We highlight the importance of aligning a diffusion language model’s preference-inference trajectory with its post-training objective. To this end, we propose TraceRL, a trajectory-aware reinforcement learning framework for DLMs that incorporates information from inference trajectories into post-training and is applicable to both full-attention and block-attention diffusion models. We also introduce a diffusion-based value model that enhances training stability and naturally accommodates process rewards. We demonstrate TraceRL’s superiority in enhancing a model’s reasoning ability on complex math and coding tasks, as well as its applicability in scaling block diffusion models to larger block sizes. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than Qwen2.5-7B-Instruct, TraDo-4B-Instruct consistently outperforms it on complex math reasoning tasks. TraDo-8B-Instruct achieves 4.5% higher accuracy on MATH500 than Qwen2.5-7B-Instruct and 6.6% higher accuracy on LiveCodeBench-V2 than Llama3.1-8B-Instruct. Through curriculum learning, we also develop the first 8B-scale long-CoT diffusion language model. We open-source our code at [https://github.com/Gen-Verse/dLLM-RL](https://github.com/Gen-Verse/dLLM-RL).

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 2125

Loading