Abstract: In this paper, we introduce a novel framework for joint scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a scene graph, followed by image generation conditioned on the generated scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the unconditional generation of scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations between objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in scene graph generation on the Visual Genome and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task's complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: (1) achieving superior results in a range of scene graph completion tasks, and (2) enhancing scene graph detection models by leveraging additional training samples generated by DiffuseSG.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Xuming_He3
Submission Number: 4438
Loading