Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

Yanbing Zeng; Jia Wang; Mahanghang; Junqiang Wu; Jie Zhu; Xiaoming Wei; Jie Hu

Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

Yanbing Zeng, Jia Wang, Mahanghang, Junqiang Wu, Jie Zhu, Xiaoming Wei, Jie Hu

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unified Muiltimodal Models, Understanding Enhances Generation, Diffusion Models, Adapters

Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently fed into a T2I backbone, which uses the enhanced instruction as textual input. To further explore the core advantage of this paradigm, we conduct comprehensive studies on the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application.

Primary Area: generative models

Submission Number: 7014

Loading