In Situ 3D Scene Synthesis for Ubiquitous Embodied Interfaces

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Virtual reality (VR) provides an interface to access virtual environments anytime and anywhere, allowing us to experience and interact with an immersive virtual world. It has been widely used in various fields, such as entertainment, training, and education. However, the user's body cannot be separated from the physical world. When users are immersed in virtual scenes, they encounter safety and immersion issues caused by physical objects in the surrounding environment. Although virtual scene synthesis has attracted widespread attention, many popular methods are limited to generating purely virtual scenes independent of the physical environment or simply mapping all physical objects as obstacles. To this end, we propose a scene agent that synthesizes situated 3D virtual scenes as a kind of ubiquitous embodied interface in VR for users. The scene agent synthesizes scenes by perceiving the user's physical environment as well as inferring the user's demands. The synthesized scenes maintain the affordances of the physical environment, enabling immersive users to interact with the physical environment and improving the user's sense of security. Meanwhile, the synthesized scenes maintain the style described by the user, improving the user's immersion. The comparison results show that the proposed scene agent can synthesize virtual scenes with better affordance maintenance, scene diversity, style maintenance, and 3D intersection over union (3D IoU) compared to state-of-the-art baseline methods. To the best of our knowledge, this is the first work that achieves in situ scene synthesis with virtual-real affordance consistency and user demand.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This paper proposes a scene agent that synthesizes situated 3D virtual scenes as a kind of ubiquitous embodied interface in VR for users. The scene agent synthesizes scenes by perceiving the user's physical environment as well as inferring the user's demands. The synthesized scenes maintain the affordances of the physical environment, enabling immersive users to interact with the physical environment and improving the user's sense of security. In particular, the user's needs are expressed via text or language, and the information about the physical environment is obtained through the three-dimensional scene reconstructed from RGBD images. The Large Language Model is used as a bridge to achieve scene synthesis. These are related to the themes of the conference.
Supplementary Material: zip
Submission Number: 5061
Loading