HieraQuery: Bridging Multimodal Understanding and High-Quality Generation through Multi-Scale Query Learning
Keywords: Unified multimodal models, Diffusion Models
Abstract: Unified multi-modal LLMs enable the integration of visual understanding and generation in a single framework. Recent study shows that a set of learnable queries can serve as an effective interface between autoregressive multimodal LLMs and diffusion models, though the visual quality of generated images still lag behind dedicated generation models. The major bottleneck lies in the difficulty of a single set of learnable queries to generate accurate visual representations in a single round of inference. Hence, we introduce HieraQuery, which leverages a hierarchy of learnable visual queries to generate high-quality visual contents in a coarse-to-fine manner. Specifically, several sets of learnable queries are provided to the language model, where preceding ones are used to generate images of lower resolution, focusing on the global structures of the generated content, while the subsequent ones serve as the condition for generating higher resolution images, concentrating on the fine-grained details. In addition, a multi-scale representation alignment strategy is proposed to enforce cross-scale consistency and accelerate convergence. Ablation analyses demonstrate that using the hierarchical visual queries can effectively improve the visual generation capability of unified multi-modal LLMs, and scaling up the number of scales proves an effective way for further improving the generation quality.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11877
Loading