ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu; Shengding Hu; Yuxuan Song; Yufei Huang; Mingxuan Wang; Hao Zhou; Zhiyuan Liu; Wei-Ying Ma; Maosong Sun

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun

Published: 29 Jan 2026, Last Modified: 29 Jan 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We validate the effectiveness of ACDiT on image, video, and text generation and show that ACDiT performs best among all autoregressive baselines under similar model scales on visual generation tasks. We also demonstrate that, benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the generative objective. The analysis of the trade-off between autoregressive and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and sheds light on new avenues for unified models.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Fix some typos and use the camera-ready template.

Code: https://github.com/thunlp/ACDiT

Assigned Action Editor: ~Jaesik_Park3

Submission Number: 6529

Loading