Learning Unmasking Policies for Diffusion Language Models

Published: 03 Mar 2026, Last Modified: 10 Mar 2026ICLR 2026 DeLTa Workshop OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion LLMs, Reinforcement Learning, Efficient Inference
TL;DR: We use RL to learn samplers for diffusion LLMs.
Abstract: Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the \textit{sampling procedure} that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting.
Submission Number: 39
Loading