Keywords: diffusion language models, diffusion models, large language models, inference-time scaling, predictor-corrector sampling, efficient taining
TL;DR: We generalize previous predictor-corrector samplers for discrete diffusion models to arbitrary noising process, and propose a memory efficient curriculum learning algorithm, 3x faster than Duo's original curriculum.
Abstract: Uniform-state discrete diffusion models excel at few-step generation and guidance due to their inherent ability to self-correct, making them more preferable than autoregressive or masked diffusion models in these settings. Yet, their sampling efficiency has been limited by reliance on standard posterior samplers, which plateau in quality as steps increase. In this work, we introduce a novel family of Predictor–Corrector (PC) samplers for discrete diffusion models that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers significantly outperform ancestral sampling on both language and vision tasks: achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve generation quality with more sampling steps, narrowing the gap with Masked diffusion. Beyond sampling, we develop a fast and memory-efficient curriculum for Duo$^{++}$'s (our method) Gaussian relaxation phase, which avoids materializing large Gaussian-diffused one-hot vectors. This reduces training time by 25\% compared to Duo while maintaining similar validation perplexity on OpenWebText and LM1B and strong downstream performance.
Primary Area: generative models
Submission Number: 21321
Loading