Abstract: Image editing in our daily lives often requires models to first understand user’s intention and then proceed with the editing. Despite significant advancements in image editing technology, understanding and executing complex instructions remains a substantial challenge. Existing image editing models either fail to comprehend complex intentions or make errors when dealing with multiple objects. To address these challenges, we present an innovative image editing framework that employs Chain-of-Thought (CoT) reasoning and localizing capabilities of multimodal Large Language Models (LLMs) to assist diffusion models in generating more refined images. We meticulously design a CoT process comprising instruction decomposition, region localization, and detailed description. We train our model to learn the CoT process and the mask of the edited image. By providing diffusion models with generated prompts and generated masks, our model edits images with a superior understanding of instructions. Extensive experiments demonstrate that our model outperforms existing state-of-the-art models in image generation both qualitatively and quantitatively. Notably, our model exhibits an enhanced ability of understanding complex prompts and generating corresponding images.
External IDs:dblp:conf/icassp/KangZWXL25
Loading