Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models

Bicheng Xu; Qi Yan; Renjie Liao; Lele Wang; Leonid Sigal

Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models

Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal

Published: 10 Aug 2025, Last Modified: 10 Aug 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: A grounded scene graph represents a visual scene as a graph, where nodes denote objects (including labels and spatial locations) and directed edges encode relations among them. In this paper, we introduce a novel framework for joint grounded scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a grounded scene graph, followed by image generation conditioned on the generated grounded scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the generation of grounded scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible grounded scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations among objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines grounded scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in grounded scene graph generation on the Visual Genome and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task’s complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: (1) achieving superior results in a range of grounded scene graph completion tasks, and (2) enhancing grounded scene graph detection models by leveraging additional training samples generated by DiffuseSG. Code is available at https://github.com/ubc-vision/DiffuseSG.

Submission Length: Long submission (more than 12 pages of main content)

Code: https://github.com/ubc-vision/DiffuseSG

Assigned Action Editor: ~Xuming_He3

Submission Number: 4438

Loading