Keywords: Diffusion, Image editing
TL;DR: GPT-IMAGE-EDIT-1.5M is a unified, million-scale dataset for instruction-guided image editing, demonstrating through systematic experiments that token-wise conditioning and T5 embeddings significantly enhance robustness and editing performance.
Abstract: Recent advancements in proprietary multimodal models such as GPT-Image-1 have set new standards for high-fidelity, instruction-guided image editing. However, their closed-source nature restricts open research and reproducibility. To bridge this gap, we introduce GPT-IMAGE-EDIT-1.5M, a publicly available dataset comprising over 1.5 million high-quality editing triplets systematically unified from OmniEdit, HQEdit, and UltraEdit. Our data-curation pipeline leverages output regeneration and instruction rewriting to significantly enhance instruction following (IF) and perceptual quality (PQ), while relying only on simple geometric and instruction-level filters. We benchmark three MMDiT diffusion architectures—SD3 InstructPix2Pix (channel-wise conditioning), Flux with SigLIP (token-wise conditioning), and FluxKontext (token-wise conditioning)—to analyze their robustness against IP degradation. Our results indicate that token-wise conditioning methods consistently outperform channel-wise conditioning. To ensure evaluation transparency, we specify when results involve thinking-rewritten prompts to avoid potential ambiguity. Moreover, we examine text encoders within a common frozen-encoder scenario, demonstrating that T5 embeddings consistently meet or exceed multimodal large language model (MLLM) embeddings, particularly with lengthier prompts. Simple linear or query-based integration methods, however, offer limited improvements, indicating deeper cross-modal fusion methods may be necessary. Fine-tuning FluxKontext on GPT-IMAGE-EDIT-1.5M achieves open-source performance competitive with GPT-Image-1 (7.66 @ GEdit-EN and 3.90 @ ImgEdit-Full, with thinking-rewritten prompts; 8.97 @ Complex-Edit). Our findings highlight critical interactions among instruction complexity, semantic alignment, and identity preservation, informing future directions in open-source image editing.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13534
Loading