Learning Unmasking Policies for Diffusion Language Models

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 spotlightEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We use RL to learn samplers for diffusion LLMs.
Abstract: Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the \textit{sampling procedure} that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting. Our code is available at [https://github.com/apple/ml-rl-dllm](https://github.com/apple/ml-rl-dllm).
Lay Summary: Most language models generate text left-to-right. Recently, models have been proposed which can generate text in any order -- forwards, backwards, one word here and another word there, and so on. These models can offer increased speed and flexibility. However, using them requires a strategy for how many and which words to "reveal" in each step. Previous approaches have typically required writing down such strategies by hand. This is challenging in complex settings such as writing mathematical proofs or generating programs. We instead model this as a sort of game, where we get to choose which words to reveal at each step, and at the end receive a reward based on the quality of the final answer and how quickly we got there. We then automatically learn good strategies by repeatedly playing this game. By automating the process, we are able to discover novel strategies that lead to better performance than hand-written rules, especially when generating many words at once.
Link To Code: https://github.com/apple/ml-rl-dllm
Primary Area: Deep Learning
Keywords: Diffusion LLMs, Reinforcement Learning, Efficient Inference, Diffusion Language Models
Originally Submitted PDF: pdf
Submission Number: 7998
Loading