OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for physics/chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18% on editing tasks UniWorld-V1 on ImgEdit-Bench and 13% on generation tasks Harmon on GenEval. Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
Lay Summary: AI systems that create and edit images from text are only as good as the examples they learn from. Existing training datasets cover common tasks such as changing an image’s style or adding or removing simple objects, but they often miss harder situations that people need in real applications, such as drawing scientific diagrams or following instructions that require several edits at once. We address this problem by building OpenGPT-4o-Image, a large dataset designed to teach image AI systems a broader and more organized set of skills. We first created a detailed map of image-generation and editing tasks, covering 11 major areas and 51 smaller task types. Using this structure, we built an automated process with GPT-4o to generate 80,000 high-quality instruction-and-image examples with diverse and carefully controlled content. When leading image AI models were trained on our dataset, they performed much better on standard tests, improving by up to 18% on image editing and 13% on image generation. This shows that carefully designed training data is a key step toward making multimodal AI systems more useful, reliable, and capable in real-world scenarios.
Link To Code: https://github.com/NROwind/OpenGPT-4o-Image
Primary Area: Applications->Computer Vision
Keywords: Generation, Editing, Dataset, Unify MLLMs
Flagged For Ethics Review: true
Originally Submitted PDF: pdf
Submission Number: 23215
Loading