Co-EditBench: Human-Aligned Benchmark for Instruction-Based Image Editing with Multi-Dimensional Assessment

Bowen Fu; pengxin zhan; Jiaqi Tang; Qing-Guo Chen; Guo-Hua Wang; Liangfu Cao; Lei Zhang; Wei Wei

Co-EditBench: Human-Aligned Benchmark for Instruction-Based Image Editing with Multi-Dimensional Assessment

Bowen Fu, pengxin zhan, Jiaqi Tang, Qing-Guo Chen, Guo-Hua Wang, Liangfu Cao, Lei Zhang, Wei Wei

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Editing; Benchmark

TL;DR: We present Co-EditBench, a benchmark for image editing with over 1,000 image-instruction pairs, 11 evaluation dimensions, and a novel automatic evaluation pipeline.

Abstract: Multimodal large language models (MLLMs) have made significant progress in instruction-guided image editing; however, comprehensively evaluating them in a way that aligns with human judgment remains a considerable challenge. Existing benchmarks often exhibit obvious limitations, including restricted editing types, limited evaluation dimensions, coarse perception of image details, and systematic deviation from subjective aesthetics. To overcome these issues, we proposed a more comprehensive evaluation benchmark, Co-EditBench, for human-aligned evaluation. First, we constructed a diagnostic dataset by crowd-sourcing, to obtain high-resolution, real-world image-instruction pairs covering 16 editing types. Then, to enable a fine-grained and consistent assessment, we define 11 novel evaluation dimensions that dissect “AI artifacts” into traceable visual pathologies. Additionally, we propose a comprehensive automated evaluation pipeline Co-EditEval that leverages multi-dimensional evaluators and a meticulously designed Chain of Thought for contextualized visual reasoning. Extensive experiments demonstrate that Co-EditBench provides a more reliable and nuanced evaluation than existing benchmarks, achieving a significant correlation with human judgments.

Primary Area: datasets and benchmarks

Submission Number: 6037

Loading