SpatialEdit: Unlocking the Spatial Capability in Multimodal Large Language Model Driven Image Editing

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Editing, Multimodal LLM
TL;DR: MLLM-driven image editing methods perform poorly when faced with prompts that require spatial information. We theoretically analyzed the possible reasons for this phenomenon and proposed the SpatialEdit framework to address this issue.
Abstract: Current instruction-guided image editing methods generally believes that incorporating powerful Multimodal Large Language Model (MLLM) can significantly enhance the understanding of complex instructions, thereby improving editing outcomes and generalization. However, even using an powerful MLLM model such as GPT4V, disappointing results are observed when instructions involve simple spatial information such as ``change the clothes color of the leftmost person to red''. Our theoretical analysis suggests that both the training strategy and the model aggregation manner in the current paradigm may contribute to unsatisfactory spatial image editing capabilities. Consequently, we propose the SpatialEdit framework, featuring a two-stage training approach and a novel data engine where questions and instructions are enriched with spatial information. Further theoretical analysis of our method reveals its ability to increase proficiency in both spatial editing and general image editing tasks. We create a benchmark to evaluate spatial editing ability. We conduct zero-shot image editing experiments on various datasets and our method achieves SOTA results on several key metrics.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9211
Loading