Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models

19 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning, LLM, Diffusion language models
TL;DR: We propose a diffusion-step aware policy optimization for improving reasoning ability of diffusion large language models
Abstract: Diffusion language models (dLLMs) offer a promising non-autoregressive paradigm for text generation, but training them for complex reasoning remains challenging. Current reinforcement learning approaches typically rely on sparse, outcome-based rewards, which can lead to inefficient exploration and ``unstructured refinement'', where the model's iterative denoising steps fail to contribute meaningfully to the solution. While Process Reward Models (PRMs) effectively mitigate similar issues in autoregressive models, they often require expensive human annotation or external verifiers. In this work, we propose Step-Aware Policy Optimization (SAPO), a method to derive automatic process rewards for dLLMs without external supervision. By leveraging the diffusion model's natural operation, we design a reward function that incentivizes distributing problem complexity evenly across the denoising trajectory. This intrinsic process supervision guides the model to learn structured, robust reasoning paths, reducing the risk of derailing from correct traces. Our empirical results demonstrate that SAPO significantly improves performance on challenging reasoning benchmarks and enhances the interpretability of the generation process.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14870
Loading