Keywords: Multimodal LLM, Image Composition Recommendation
Abstract: Compositing an object into a given image is a common task in image editing, requiring both creative ideation and technical precision to achieve harmony. Professionals start by brainstorming concepts, then scale and position elements to integrate seamlessly within the image. While recent diffusion models have made significant progress in pixel harmonization, suggesting suitable concepts as well as recommending compatible composition locations and scaling remains a less explored and challenging task. In this work, we leverage the advanced reasoning capabilities of Multimodal Large Language Models (MLLMs) to address these challenges. We first propose a data pipeline that automatically generates diverse, high-quality, large-scale training data from an internet-scale stock image. Using this dataset, we fine-tune MLLMs with enhanced projector designs and targeted data augmentation to achieve robust content recommendation and precise object placement, demonstrating strong performance against prior methods. Our model supports flexible input options—either image or text—alongside user-defined placement control, offering designers a new level of creative flexibility. Finally, we showcase the model’s impact in real-world editing workflows where our model achieves state-of-the-art performance consistently on image composition benchmarks, including our self-created in-the-wild evaluation dataset Composition1K. The code and the Composition1K dataset are provided at https://anonymous.4open.science/r/MGCR.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2485
Loading