TL;DR: We introduced Self-Aware Scheduling (SAS), a plug-and-play framework for optimizing the unmasking order of masked diffusion language models.
Abstract: Masked diffusion language models decode by iteratively unmasking tokens, where the unmasking order defines an ``order of thought''
that strongly influences generation quality yet is typically chosen heuristically.
We derive a tractable upper bound on the sequential decoding mismatch, measured by the Kullback–Leibler divergence and expressed in terms of the model’s pathwise log-likelihood, with tightness under sufficient model expressivity.
This bound induces a dense self-aware reward for a target sequence $x$ and unmasking order $\sigma$,
over ordered paths,
casting order selection as a principled policy optimization problem with a frozen denoiser.
We instantiate this idea as **Self-Aware Scheduling (SAS)**, which learns a lightweight order policy using Group Relative Policy Optimization and applies seamlessly to both sequential and semi-autoregressive decoding.
On Sudoku with 1B MDM, SAS improves puzzle accuracy from $82.0\%$ (best heuristic schedule) to $91.8\%$, and reaches $97.9\%$ with second-stage fine-tuning along learned trajectories.
On LLaDA-8B, SAS improves pass@1 on GSM8K from $64\%$ to $76\%$ (full diffusion) and on MBPP from $39.5\%$ to $41\%$, while consistently matching or exceeding heuristic schedules across generation lengths and block sizes.
Lay Summary: Most large language models write text from left to right, one token at a time. Diffusion language models offer a different way: they start with a mostly blank sentence and gradually fill in the missing pieces. But this raises an important question: in what order should the model fill the blanks? Today, this order is usually chosen by simple rules, such as filling the tokens the model is most confident about first, but these rules can be short-sighted. In this work, we show that the order itself can be learned. We introduce Self-Aware Scheduling, a lightweight method that teaches a diffusion language model when to reveal each piece of an answer, using the model’s own likelihood as feedback while keeping the main model fixed. Across Sudoku, math, and coding tasks, learning this “order of thought” improves reasoning performance over common hand-designed schedules. Our results suggest that future diffusion language models can get better not only by learning what to generate, but also by learning when to commit to each part of the answer.
Originally Submitted Supplementary Material: zip
Link To Code: https://jimmyxu123.github.io/SAS/
Primary Area: Deep Learning->Large Language Models
Keywords: masked diffusion, reinforcement learning, reasoning, LLMs
Originally Submitted PDF: pdf
Submission Number: 31690
Loading