GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

ICLR 2026 Conference Submission13534 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion, Image editing

TL;DR: GPT-IMAGE-EDIT-1.5M is a unified, million-scale dataset for instruction-guided image editing, demonstrating through systematic experiments that token-wise conditioning and T5 embeddings significantly enhance robustness and editing performance.

Abstract: Recent advancements in proprietary multimodal models such as GPT-Image-1 have set new standards for high fidelity, instruction guided image editing. How-ever, their closed-source nature restricts open research and reproducibility. To bridge this gap, we introduce GPT-IMAGE-EDIT-1.5M, a publicly available dataset comprising over 1.5 million high-quality editing triplets systematically unified from OmniEdit, HQEdit, and UltraEdit. Our data curation pipeline lever-ages output regeneration and instruction rewriting to significantly enhance in-struction following (IF) and perceptual quality (PQ), while intentionally preserv-ing challenges in identity preservation (IP) typical of GPT-generated images. We benchmark three MMDiT diffusion architectures—SD3 InstructPix2Pix (channel-wise conditioning), Flux with SigLIP (token-wise conditioning), and FluxKon-text (token-wise conditioning) to analyze their robustness against IP degradation. Our results indicate that token-wise conditioning methods consistently outperform channel-wise conditioning. To ensure evaluation transparency, we specify when results involve thinking-rewritten prompts to avoid potential ambiguity. Moreover, we examine text encoders within a common frozen-encoder scenario, demonstrat-ing that T5 embeddings consistently meet or exceed multimodal large language model (MLLM) embeddings, particularly with lengthier prompts. Simple linear or query-based integration methods, however, offer limited improvements, indicating deeper cross-modal fusion methods may be necessary. Fine-tuning FluxKontext on GPT-IMAGE-EDIT-1.5M achieves open-source performance competitive with GPT-Image-1 (7.66@GEdit-EN and 3.90@ImgEdit-Full, with thinking-rewritten prompts; 8.97@Complex-Edit). Our findings highlight critical interactions among instruction complexity, semantic alignment, and identity preservation, informing future directions in open-source image editing.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13534

Loading