Keywords: dataset, compositional image generation, diffusion model
Abstract: Despite their success in generating high-quality images, text-to-image (T2I) models struggle to generate compositional scenes with multiple objects and their intricate relationships. We attribute this issue to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To resolve this, we construct LAION-Comp, a large-scale dataset of 540K+ aesthetic images structurally annotated with detailed scene graphs explicitly encoding multiple objects, corresponding attributes, and intricate relations. The annotation pipeline employs a large vision-language model followed by partial human verification. Using LAION-Comp, we train 4 baseline models on diffusion and flow matching backbones augmented with a designed scene graph encoder. For proper evaluation, we introduce CompSGen Bench, a benchmark with 20,838 testing samples designed to systematically evaluate complex compositions. Experiments show that the 4 models trained on LAION-Comp outperform their original prompt-only counterparts and advanced scene-graph-based methods on both our new and existing compositional benchmarks. Furthermore, the learned structural conditioning naturally enables fine-grained, object-level image editing, demonstrating its potential as an effective editing interface. Our work validates the advantages of explicit structural annotation and contributes the community with a foundational resource to advance controllable and compositional image synthesis.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 5794
Loading