HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Ling Yang; Xinchen Zhang; Ye Tian; Shiyi Zhang; Chenming Shang; Minghao Xu; Wentao Zhang; Bin CUI

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Ling Yang, Xinchen Zhang, Ye Tian, Shiyi Zhang, Chenming Shang, Minghao Xu, Wentao Zhang, Bin CUI

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unified Multimodal Understanding and Generation

Abstract: The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 made notable strides in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capability of MLLMs is usually stronger than their generative capability, with a significant gap between them. Building on this insight, we propose HermesFlow, a simple and general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 12428

Loading