Abstract: 3D Visual Question Answering (3D-VQA) aims to understand the positional relationship, object attributes and layout in 3D scenes. A challenging issue is to align the representations between the entities associated with questions and the relevant 3D objects while diminishing the fine-grained vision-language relations. To this end, we propose a Scene-guided Attention Network for Spatial Understanding, denoted as SceSU, to perceive spatial information and attribute information based on different types of questions in 3D scenes. More specifically, a Scene-Driven Spatial Understanding (SDSU) mechanism is designed to obtain the crucial entities relevant to the question, and construct the fine-grained scene description via a large language model. Furthermore, a 3D Perception Attention (3D-PA) module is devised to fuse natural language and 3D features while understanding the detailed relationship between them by employing a dual-branch attention network. Finally, SceSU utilizes the 3D-PA to comprehend the fine-grained scene description generated by the SDSU mechanism, bridging the gap between natural language and 3D domains. Extensive experiments conducted on SQA3D and ScanQA datasets demonstrate the effectiveness of the SceSU for 3D-VQA.
External IDs:dblp:conf/mir/JiangZL0Y25
Loading