Masked AutoRegressive Detokenization with Semantic Visual Tokens for High-fidelity Image Synthesis

Guiwei Zhang; Tianyu Zhang; Mohan Zhou; Yalong Bai; Zichang Tan; Ying Ba; Yang Yang

Masked AutoRegressive Detokenization with Semantic Visual Tokens for High-fidelity Image Synthesis

Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Zichang Tan, Ying Ba, Yang Yang

18 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Masked Autoregressive Tokenization, Autoregressive Image Generation, Vocabulary-enriched Visual Tokenization

Abstract: We propose MaDiT, a Masked autoregressive Detokenization Transformer for visual reconstruction and generation. It formulates visual tokenization as a flow-matching problem: the model learns a mapping from a standard normal distribution to the distribution of image data, conditioned on discrete visual and text tokens as well as intermediate autoregressive context. The effectiveness of MaDiT stems from two core designs. First, a masked autoencoder (MAE) fuses multi-modal cues from vocabulary priors and partially unmasked patterns to produce discrete visual tokens imbued with semantic meaning. This mitigates ambiguity and information loss that plague vanilla vector-quantized (VQ) representations. Second, we introduce a masked autoregressive de-tokenization pipeline that reconstructs images in a low- to high- frequency fashion. By initially focusing on flat, low-frequency regions and progressively refining higher-frequency details, our model reconstruct images with significantly improved fidelity. Within this pipeline, a masked decoder generates context-rich embeddings, conditioning a dedicated velocity field for precise final reconstruction. Extensive experiments show that MaDiT outperforms mainstream VQ tokenizers and enables high-fidelity visual generation on top of existing LLMs.

Primary Area: generative models

Submission Number: 10506

Loading