Scaling Behavior of Discrete Diffusion Language Models

Scaling Behavior of Discrete Diffusion Language Models

ICLR 2026 Conference Submission21664 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion, discrete diffusion, diffusion language models, scaling, scaling laws, optimal batch size, critical batch size

TL;DR: We find that uniform diffusion language models outscale both masked diffusion and autoregressive models in terms of both compute- and data-bound scaling.

Abstract: Modern LLM pre-training consumes vast amounts of both compute resources and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments show that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. Surprisingly, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion. Moreover, uniform diffusion models scale more favorably in both compute and data than their masked counterparts, making them a promising option in both compute- and data-bound training environments. In the process of deriving the scaling laws, we reformulate the discrete diffusion ELBO in terms of signal-to-noise ratio, closing the gap to continuous diffusion theory and simplifying both theory and implementation. We also find that DLMs have an optimal batch size with no signs of saturation, which is in contrast to ALMs, which typically show diminishing returns from scaling batches beyond $10^6$ tokens. Training code and models are open-sourced: upon acceptance

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21664

Loading