Abstract: Recent works have explored text-guided image editing using diffusion models and generated edited images based on text prompts. However, the models struggle to accurately locate the regions to be edited and faithfully perform precise edits. In this work, we propose a framework termed InstructEdit that can do fine-grained editing based on user instructions. Our proposed framework has three components: language processor, segmenter, and image editor. The first component, the language processor, processes the user instruction using a large language model. The goal of this processing is to parse the user instruction and output prompts for the segmenter and captions for the image editor. We adopt ChatGPT and optionally BLIP2 for this step. The second component, the segmenter, uses the segmentation prompt provided by the language processor. We employ a state-of-the-art segmentation framework Grounded Segment Anything to automatically generate a high-quality mask based on the segmentation prompt. The third component, the image editor, uses the captions from the language processor and the masks from the segmenter to compute the edited image. We adopt Stable Diffusion and the mask-guided generation from DiffEdit for this purpose. Experiments show that our method outperforms previous editing methods in fine-grained editing applications where the input image contains a complex object or multiple objects. We improve the mask quality over DiffEdit and thus improve the quality of edited images. We also show that our framework can be combined with the NeRF or video editing pipeline to achieve fine-grained scale NeRF or video editing application.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - We added a teaser in Figure 1 in the main paper.
- We provided the project webpage in the supplementary material.
- We added additional NeRF editing results using NeRF-Art in Figure 12.
- We added testing examples from InstructPix2Pix in Figure 13 in the appendix.
- We added the results of more (difficult) editing applications in Figure 14 in the appendix.
- We included a broader impact section in Section B in the appendix.
- We changed section 2.3 to section 3 in the main paper. We also add an overall description of the method section under section 3.
Assigned Action Editor: ~Yu-Xiong_Wang1
Submission Number: 2749
Loading