Abstract: Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs.
Lay Summary: Autoregressive models are a new type of AI system for generating images. They create images piece by piece at a time, like assembling a puzzle. However, this process often starts without a clear idea of the full picture, which can lead to images that are locally correct but globally inconsistent. We propose a new approach that helps the AI start with a rough sketch of the image — a low-resolution version that captures the global layout. This sketch acts as a guide, allowing the model to better fill in the fine details in a more coherent and structured way. Our method, called Hi-MAR, builds the image in multiple stages, first focusing on the overall structure and then refining the finer details. We also design a new component that improves the model’s ability to maintain global consistency. This leads to better image quality with lower computational cost. Our work improves how AI models generate images, making them more efficient and reliable for applications like art, design, and virtual content creation.
Link To Code: https://github.com/HiDream-ai/himar
Primary Area: Applications->Computer Vision
Keywords: Masked Autoregressive Models, Image Generation
Submission Number: 10026
Loading