RelScene: A Benchmark and baseline for Spatial Relations in text-driven 3D Scene Generation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-driven 3D indoor scene generation aims to automatically generate and arrange the objects, which form a 3D scene that accurately captures the semantics detailed in the given text description. Recent works have shown the potential to generate 3D scenes guided by specific object categories and room layouts but lack a robust mechanism to maintain consistent spatial relationships in alignment with the provided text description during the 3D scene generation. Besides, the annotations of the object and relationships of the 3D scenes are usually time- and cost-consuming, which are not easily obtained for the model training. Thus, in this paper, we conduct a dataset and benchmark for assessing spatial relations in text-driven 3D scene generation, which contains a comprehensive collection of 3D scenes, including textual descriptions, annotating object spatial relations, and providing both template and free-form natural language descriptions. We also provide a pseudo description feature generation method to address the 3D scenes without language annotations. We design an aligned latent space for spatial relation in 3D scenes and text description, in which we can sample the features according to the spatial relation for the few-shot learning. We also propose new metrics to investigate the ability of the approach to generate correct spatial relationships among objects.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Social Aspects of Generative AI, [Generation] Multimedia Foundation Models, [Experience] Multimedia Applications
Relevance To Conference: This submission is under the scope of Generative Multimedia, necessitating multimedia systems to produce content with realism and diversity. Text-driven 3D indoor scene generation aims to automatically generate and arrange the objects, which form a 3D scene that accurately captures the semantics detailed in the given text description. In this paper, we propose a new benchmark for assessing spatial relations in text-driven 3D scene generation. Furthermore, we propose a simple yet effective baseline method to address the 3D scene generation, which maintains consistent spatial relationships in alignment with the provided text description.
Supplementary Material: zip
Submission Number: 5342
Loading