3D Question Answering with Scene Graph Reasoning

Zizhao Wu, Haohan Li, Gongyi Chen, Zhou Yu, Xiaoling Gu, Yigang Wang

Published: 01 Jan 2024, Last Modified: 11 Nov 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: 3DQA has gained considerable attention due to its enhanced spatial understanding capabilities compared to image-based VQA. However, existing 3DQA methods have explicitly focused on integrating text and color-coded point cloud features, thereby overlooking the rich high-level semantic relationships among objects. In this paper, we propose a novel graph-based 3DQA method termed 3DGraphQA, which leverages scene graph reasoning to enhance the ability to handle complex reasoning tasks in 3DQA and offers stronger interpretability. Specifically, our method first adaptively constructs dynamic scene graphs for the 3DQA task. Then we inject both the situation and the question inputs into the scene graph, forming the situation-graph and the question-graph, respectively. Based on the constructed graphs, we finally perform intra- and inter-graph feature propagation for efficient graph inference: intra-graph feature propagation is performed based on Graph Transformer in each graph to realize single-modal contextual interaction and high-order contextual interaction; inter-graph feature propagation is performed among graphs based on bilinear graph networks to realize the interaction between different contexts of situations and questions. Drawing on these intra- and inter-graph feature propagation, our approach is poised to better grasp the intricate semantic and spatial relationship issues among objects within the scene and their relations to the questions, thereby facilitating reasoning complex and compositional questions. We validate the effectiveness of our approach on SQA3D and ScanQA datasets, and expand the SQA3D dataset to SQA3D Pro with multi-view information, making it more suitable for our approach. Experimental results demonstrate that our 3DGraphQA outperforms existing methods.