The Diffusion Duality

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We prove that discrete diffusion processes admit an underlying Gaussian diffusion formulation which enables the design of faster training and sampling algorithms.
Abstract: Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, **doubling training speed** by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm **unlocks few-step generation in diffusion language models** by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo
Lay Summary: Today’s language models typically generate text one word at a time using an autoregressive (AR) approach, which lacks the ability to revise earlier predictions. In contrast, a newer class—diffusion language models—can predict multiple words simultaneously and revise their predictions, offering both self-correction and the potential for faster generation. However, they often lag behind AR methods in overall performance. In this work, we uncover a surprising connection: these discrete diffusion models are fundamentally linked to a more powerful class of models based on continuous Gaussian diffusion, which has achieved remarkable success in image generation. Building on this insight, we introduce Duo—a new training and sampling framework that transfers advanced techniques from the continuous domain to the discrete setting. This results in both faster training and improved model quality, with **Duo outperforming AR models on several benchmarks**. To enable rapid text generation, we also propose a novel algorithm called Discrete Consistency Distillation. For instance, while traditional AR models require roughly 1,000 steps to generate 1,000 words, Duo can achieve the same result in as few as 10 steps—**a 100× speedup**. Together, these advances bring us closer to real-time, high-performance language models for applications such as smarter and faster chatbots.
Link To Code: https://github.com/s-sahoo/duo
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: diffusion language models, diffusion models, large language models, distillation
Submission Number: 13196
Loading