Abstract: Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our model revolutionizes scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: In summary, the contributions of our work are as follows:
* We generate a new dataset for scene modification tasks by designing a GPT-aided data pipeline for paraphrasing the descriptive texts in Referit3D dataset to generative instructions, referred to Nr3D_SA and Sr3D_SA datasets. The dataset will be released to the public and can be utilized for comparable tasks in subsequent studies.
* We propose an end-to-end multi-modal diffusion-based deep neural network model for generating in-door 3D objects into specific scenes according to input instructions.
* We propose quantified position prediction, a simple but effective technique to predict Top-K candidate positions, which mitigates false negative problems arising from the ambiguity of language and provides reasonable options.
* We introduce the visual grounding task as an evaluation strategy to assess the quality of a generated scene and integrate several metrics to evaluate the generated objects.
Supplementary Material: zip
Submission Number: 5686
Loading