Keywords: ARGen-Dexion, Autoregressive, Image Generation, Decoder
Abstract: Autoregressive models (ARGen) have emerged as a cornerstone for image generation within multimodal large language models (MLLMs), yet their visual outputs remain stubbornly underwhelming. Traditional efforts, scaling AR models or re-engineering architectures, yield diminishing returns at exorbitant cost, straining infrastructure without resolving core limitations. In this work, we challenge the status quo, asserting that vision decoders must shoulder greater responsibility for image synthesis, liberating autoregressive models from undue burden. We present ARGen-Dexion, a groundbreaking, systematic overhaul of the vision decoder that redefines autoregressive image generation without modifying pre-trained AR models or visual encoders. Our approach delivers transformative gains through three innovations: (1) a scaled, fine-tuned decoder achieving unprecedented reconstruction fidelity, (2) bi-directional Transformer-based token refiner that infuses global context to refine the AR model outputs, shattering the constraints of causal inference inherent, and (3) a resolution-aware training strategy enabling seamless multi-resolution and multi-aspect-ratio synthesis. Extensive scaling studies unveil deep insights into decoder design, challenging long-held assumptions. Empirically, ARGen-Dexion boosts LlamaGen by a striking 9% VQAScore on the GenAI-Benchmark, and can be applied to various Next-token-prediction MLLMs. This work compels a bold rethinking of the interplay between MLLMs and vision decoders, paving the way for unified, efficient, and visually superior multimodal systems.
Primary Area: generative models
Submission Number: 8459
Loading