Keywords: text-guided image editing, multimodal large language models, instruction following, local editing
TL;DR: We present GIE-Bench, a benchmark for local text-guided image editing that disentangles functional correctness and content preservation with VQA-style and mask-based evaluation.
Abstract: Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We note that our benchmark does not cover global editing tasks such as full style transfer, which remain important but are outside our current scope. We conduct a large-scale study comparing GPT-Image-, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3492
Loading