Keywords: diffusion language models, diffusion models, large language models, inference-time scaling, predictor-corrector sampling, efficient taining
TL;DR: We generalize previous predictor-corrector samplers for discrete diffusion models to arbitrary noising process, and propose a memory efficient curriculum learning algorithm, 3x faster than Duo's original curriculum.
Abstract: Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. **Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling.** Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on [https://s-sahoo.github.io/duo-ch2/](https://s-sahoo.github.io/duo-ch2/)
Primary Area: generative models
Submission Number: 21321
Loading