Auto-Regressive Masked Diffusion Models

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: In this work, we present the Auto-Regressive Masked Diffusion model, an architecture designed to bridge this gap by unifying the training efficiency of autoregressive models with the strengths of diffusion-based learning.
Abstract: Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to bridge this gap by unifying the training efficiency of autoregressive models with the strengths of diffusion-based learning. Our key insight is to interpret masked diffusion process as a block-wise causal model. This allows us to design a strictly causal, permutation-equivariant, attention-based architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. On standard language model- ing benchmarks, ARMD achieves state-of-the- art performance, outperforming established diffusion-based methods while requiring significantly fewer training steps.
Submission Number: 1571
Loading