Struct2Real: A Systematic Framework for Accurate and Efficient Structure-Grounded Object Image Generation

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Controllable Image Generation, Topology and Spatial Layout Constraints, Multimodal Large Language Models
TL;DR: This paper introduces Struct2Real, a framework that enables realistic and structurally faithful object image generation from StructMap.
Abstract: Recent advances in image generation have enabled the creation of high-quality visual content with impressive semantic fidelity. However, generating object images under fine-grained structural constraints, particularly preserving topology and spatial layout, remains an open challenge. We propose Struct2Real, a novel framework for structure-grounded object image generation that combines explicit structural control with photorealistic generation, consisting of twofold. 1) we develop a novel structure modeling system that enables users to create a 3D structural representation named StructMap — an object structure abstraction composed of geometric primitives and their spatial layouts. 2) We design a modular image generation algorithm and combine this algorithm with multimodal large language models (MLLMs), harnessing their superior performance to generate realistic object images under structural constraints encoded in StructMap. Extensive experiments demonstrate that Struct2Real achieves strong performance in structure-grounded object image generation while ensuring low user effort required for this task, highlighting the practicality and effectiveness of our method. Please refer to more details in our appendix and supplementary material.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1508
Loading