Latent Adaptation with Masked Policy for Diffusion Language Models

ICLR 2026 Conference Submission14739 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion language model, test-time scaling
TL;DR: Test Time Policy Gradient for Diffusion Language Models
Abstract: Diffusion large language models (dLLMs) have emerged as promising alternatives to autoregressive generation, yet their ability to refine reasoning at test time remains underexplored. We present $\textbf{LAMP}$ (Latent Adaptation with Masked Policy), a training-free framework for reward-guided \emph{latent policy optimization} in masked diffusion models. LAMP treats hidden token states as optimizable latents and adapts them per instance via policy-gradient updates, enabling direct reward feedback to shape the reasoning process without altering model parameters. To accommodate diffusion’s non-sequential decoding, we adopt a masked-policy strategy that selectively reopens and edits uncertain positions while preserving global consistency through re-inpainting. This design allows targeted latent edits to propagate coherently across the diffusion trajectory. Experiments on GSM8K, MATH-500, and AIME show consistent improvements over strong dLLM baselines. Our results establish reward-guided latent adaptation as a practical and effective axis for enhancing reasoning in diffusion-based language models.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 14739
Loading