Multimodal Generative Composition Recommendation

Yizhou Wang; Lingzhi Zhang; Qing Liu; Mang Tik Chiu; Connelly Barnes; Zhe Lin; Eli Shechtman; Yuqian Zhou; Qihua Dong; Sohrab Amirghodsi; Yun Fu

Multimodal Generative Composition Recommendation

Yizhou Wang, Lingzhi Zhang, Qing Liu, Mang Tik Chiu, Connelly Barnes, Zhe Lin, Eli Shechtman, Yuqian Zhou, Qihua Dong, Sohrab Amirghodsi, Yun Fu

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal LLM, Image Composition Recommendation

Abstract: Compositing an object into a given image is a common task in image editing, requiring both creative ideation and technical precision to achieve harmony. Professionals start by brainstorming concepts, then scale and position elements to integrate seamlessly within the image. While recent diffusion models have made significant progress in pixel harmonization, suggesting suitable concepts as well as recommending compatible composition locations and scaling remains a less explored and challenging task. In this work, we leverage the advanced reasoning capabilities of Multimodal Large Language Models (MLLMs) to address these challenges. We first propose a data pipeline that automatically generates diverse, high-quality, large-scale training data from an internet-scale stock image. Using this dataset, we fine-tune MLLMs with enhanced projector designs and targeted data augmentation to achieve robust content recommendation and precise object placement, demonstrating strong performance against prior methods. Our model supports flexible input options—either image or text—alongside user-defined placement control, offering designers a new level of creative flexibility. Finally, we showcase the model’s impact in real-world editing workflows where our model achieves state-of-the-art performance consistently on image composition benchmarks, including our self-created in-the-wild evaluation dataset Composition1K. The code and the Composition1K dataset are provided at https://anonymous.4open.science/r/MGCR.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2485

Loading