Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Shuchen Xue; Tianyu Xie; Tianyang Hu; Zijin Feng; Jiacheng Sun; Kenji Kawaguchi; Zhenguo Li; Zhi-Ming Ma

Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Efficiently scaling Large Language Models (LLMs) necessitates exploring alternatives to dominant autoregressive (AR) methods, with Masked Diffusion Models (MDMs) emerging as candidates. However, comparing AR (typically decoder-only) and MDM (often encoder-only) paradigms is confounded by differing architectures, obscuring true algorithmic and efficiency trade-offs. This research decouples these factors by evaluating MDMs within a decoder-only framework to: (1) Equitably compare MDM (as Any-Order AR) and standard AR paradigms through discrepancies on orders. (2) Investigate MDM architectural impacts on computational efficiency. We show decoder-only MDMs, despite a larger modeling space, can achieve significant inference speedups ($\sim25\times$) and comparable perplexity with techniques like temperature annealing, offering a path to reduced inference compute. This work provides insights for developing more computationally efficient foundation models by disentangling core modeling choices from architectural influences. Code is available at \url{https://github.com/scxue/AO-GPT-MDM}.

Lay Summary: Most language AI systems write text one word at a time from left to right. A newer family of models can instead fill in missing words in a more flexible order, more like revising a draft than typing a sentence from beginning to end. However, it has been hard to compare these two approaches fairly, because they are usually built with different underlying model designs. In this paper, we separate these two factors. We build a GPT-style model that can predict words in many possible orders, and use it to study what comes from the training approach itself and what comes from the architecture. We find that natural language still benefits strongly from its usual left-to-right structure: training on completely arbitrary word orders can slow learning, while adding a small amount of left-to-right training helps a lot. We also find an important trade-off: some model designs are better at evaluating text, while GPT-style designs can generate text much faster, with comparable quality after careful sampling. These results help clarify how future language models might combine the flexibility of fill-in-the-blank generation with the speed and practicality of today’s GPT-like systems.

Link To Code: https://github.com/scxue/AO-GPT-MDM

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Discrete Diffusion Models, Any-Order Autoregressive Models, Causal Architecture

Originally Submitted PDF: pdf

Submission Number: 21712

Loading