Focus Where It Matters: LLM-Guided Regional Identification for Instruction-based Image Editing

Yong Man Ro

Published: 30 Sept 2025, Last Modified: 16 Apr 2026ACM MMEveryoneCC BY-NC 4.0

Abstract: Instruction-based image editing enables intuitive modifications of images through natural language descriptions. However, existing models often struggle to accurately identify the target region, which refers to the area that should be modified. As a result, unintended changes may occur in non-target areas, where the original image should remain unchanged. To address this issue, we propose FoRE, an MLLM-guided framework that identifies the target region based on the given edit instruction and performs image editing using region-aware embeddings. Within FoRE, the Region-guided Edit Adapter projects these embeddings from the MLLM domain to the diffusion condition space. Subsequently, the Region-guided Refinement Module refines the projected features to enhance spatial accuracy prior to guiding the diffusion process. Through comprehensive evaluations, we demonstrate that FoRE significantly improves localization accuracy and instruction fidelity compared to existing approaches. By explicitly incorporating region-aware conditioning, our framework effectively bridges the gap between instruction comprehension and spatially precise image modifications, advancing the capabilities of instruction-based image editing.