Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

Qinglin Zhu; Yizhen Yao; Runcong Zhao; Yanzheng Xiang; Amrutha Saseendran; Chen Jin; Philip Alexander Teare; Bin Liang; Yulan He; Lin Gui

Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui

19 Sept 2025 (modified: 19 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Language Models, Latent Refinement Decoding, Mixture Embedding

TL;DR: We propose Latent Refinement Decoding, a two-stage diffusion decoding framework that reduces information loss and improves accuracy with faster inference.

Abstract: Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LLaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalised tokens are discarded at each step, and a lack of well-behaved commitment dynamics, where local decisions are not properly coordinated at the global level. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalises confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) benchmarks show that LRD improves accuracy while delivering speedups of up to 10.6×. Moreover, LRD is orthogonal to system-level optimisation: when combined with KV-cache and parallel-based accelerators (e.g., Fast-dLLM), it improves accuracy and yields up to 2.4× additional speedup, making it a strong and versatile alternative for parallel sequence generation.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 19228

Loading