SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

ICLR 2026 Conference Submission21332 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Diffusion Language Models, Policy Gradient
TL;DR: We propose SPG, an RL algorithm for dLLMs that addresses the challenge of log-likelihood estimation by leveraging both an upper and a lower bound on the true log-likelihood. Extensive experiments showcase the effectiveness of SPG.
Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% on GSM8K, 2.6% on MATH500, 18.4% on Countdown, and 27.0% on Sudoku.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21332
Loading