TL;DR: We propose an efficient method for training masked diffusion models that speeds up training by up to 2.3x.
Abstract: Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a *training complexity* trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking.
In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on *inference-aligned masks* and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by $\approx 2.3 \times$ and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.
Lay Summary: Many modern AI systems create text by filling in missing pieces rather than writing strictly from left to right. This can make them faster and more flexible when generating answers, but it also makes them harder to train efficiently: during training, they practice many random patterns of missing words, while during actual use, they tend to fill in words in a much more organized order.
This paper introduces Progressive UnMAsking, or PUMA, a simple change to how these models are trained. Instead of showing the model random missing-word patterns, PUMA trains it on patterns that better match the ones it will encounter when generating text. This focuses training on the situations that matter most in practice.
In experiments, PUMA trains models substantially faster, reaching the same performance in about 2.3 times fewer training steps. It also works well alongside common training improvements, such as starting from an already trained language model. We show that PUMA works for large models with up to 7 billion parameters as well.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/JaeyeonKim01/PUMA
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Discrete Diffusion, Masked Diffusion Models, Diffusion Models, Distribution Design, Learning Theory, Language Models
Originally Submitted PDF: pdf
Submission Number: 7673
Loading