SRA-SD: A Lightweight Framework for Structure-Guided Compositional Image Synthesis
Keywords: structural knowledge augmented; text-to-image generation
Abstract: Diffusion models have demonstrated remarkable capabilities in text-to-image generation. However, they often fail to faithfully reflect the details specified in the text, missing objects or exhibiting objects with unmatched attributes and wrong spatial locations. To address this problem, we propose SRA-SD, a lightweight structure-aware framework that enhances generation fidelity by explicitly model-
ing both spatial relations and attribute bindings. Our method introduces two complementary modules: (1) a spatial relation enhancement module that extracts relational triples via a large language model and encodes them into heterogeneous semantic graphs, enriching the text representation with structural layout knowledge through graph neural networks; and (2) an attribute enhancement module that en-
forces fine-grained object-attribute alignment via contrastive cross-attention learning, using syntactically derived positive pairs and semantically plausible negative samples. To better evaluate both capabilities, we introduce SRA-Bench, a new benchmark that jointly assesses spatial reasoning and attribute binding. Experiments on three datasets show that SRA-SD significantly improves generation
accuracy with minimal parameter overhead, outperforming existing methods in complex, compositional scenarios.
Primary Area: generative models
Submission Number: 7144
Loading