Keywords: Masked Autoregressive Tokenization, Autoregressive Image Generation, Vocabulary-enriched Visual Tokenization
Abstract: We propose MaDiT, a Masked autoregressive Detokenization Transformer for visual reconstruction and generation. It formulates visual tokenization as a flow-matching problem: the model learns a mapping from a standard normal distribution to the distribution of image data, conditioned on discrete visual and text tokens as well as intermediate autoregressive context. The effectiveness of MaDiT stems from two core designs. First, a masked autoencoder (MAE) fuses multi-modal cues from vocabulary priors and partially unmasked patterns to produce discrete visual tokens imbued with semantic meaning. This mitigates ambiguity and information loss that plague vanilla vector-quantized (VQ) representations. Second, we introduce a masked autoregressive de-tokenization pipeline that reconstructs images in a low- to high- frequency fashion. By initially focusing on flat, low-frequency regions and progressively refining higher-frequency details, our model reconstruct images with significantly improved fidelity. Within this pipeline, a masked decoder generates context-rich embeddings, conditioning a dedicated velocity field for precise final reconstruction. Extensive experiments show that MaDiT outperforms mainstream VQ tokenizers and enables high-fidelity visual generation on top of existing LLMs.
Primary Area: generative models
Submission Number: 10506
Loading