Unifying Autoregressive And Diffusion-Based Sequence Generation

Published: 06 Mar 2025, Last Modified: 14 Apr 2025ICLR 2025 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 8 pages)
Keywords: Discrete Diffusion, Language Models, Autoregressive Language Models, KV Caching, Fast Inference
TL;DR: We build upon diffusion language models to 1. make them autoregressive and 2. use an hybrid of the "uniform" and "absorb" token noising processes.
Abstract: We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models. We introduce *hyperschedules*, which assign distinct noise schedules to individual token positions, generalizing both autoregressive models (*e.g.*, GPT) and conventional diffusion models (*e.g.*, SEDD, MDLM) as special cases. Second, we propose two *hybrid token-wise noising processes* that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes, and we introduce a *novel inference algorithm* that leverages this new feature in a simplified context inspired from MDLM. To support efficient training and inference, we design attention masks compatible with KV-caching. Our methods achieve state-of-the-art perplexity and generate diverse, high-quality sequences across standard benchmarks, suggesting a promising path for autoregressive diffusion-based sequence generation.
Submission Number: 111
Loading