HieraQuery: Bridging Multimodal Understanding and High-Quality Generation through Multi-Scale Query Learning

DanDan Zheng; Weilong Chai; libin wang; Rui Liu; Ziyuan Huang; Xinyu Xiao; Jianxin Sun; Biao Gong; Hu Yu; Qingpei Guo; Jingdong Chen; Ming Yang; JUN ZHOU; Xianjun Deng

HieraQuery: Bridging Multimodal Understanding and High-Quality Generation through Multi-Scale Query Learning

DanDan Zheng, Weilong Chai, libin wang, Rui Liu, Ziyuan Huang, Xinyu Xiao, Jianxin Sun, Biao Gong, Hu Yu, Qingpei Guo, Jingdong Chen, Ming Yang, JUN ZHOU, Xianjun Deng

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unified multimodal models, Diffusion Models

Abstract: Unified multi-modal LLMs enable the integration of visual understanding and generation in a single framework. Recent study shows that a set of learnable queries can serve as an effective interface between autoregressive multimodal LLMs and diffusion models, though the visual quality of generated images still lag behind dedicated generation models. The major bottleneck lies in the difficulty of a single set of learnable queries to generate accurate visual representations in a single round of inference. Hence, we introduce HieraQuery, which leverages a hierarchy of learnable visual queries to generate high-quality visual contents in a coarse-to-fine manner. Specifically, several sets of learnable queries are provided to the language model, where preceding ones are used to generate images of lower resolution, focusing on the global structures of the generated content, while the subsequent ones serve as the condition for generating higher resolution images, concentrating on the fine-grained details. In addition, a multi-scale representation alignment strategy is proposed to enforce cross-scale consistency and accelerate convergence. Ablation analyses demonstrate that using the hierarchical visual queries can effectively improve the visual generation capability of unified multi-modal LLMs, and scaling up the number of scales proves an effective way for further improving the generation quality.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11877

Loading