Abstract: 3DQA has gained considerable attention due to its enhanced spatial understanding capabilities compared to image-based VQA. However, existing 3DQA methods have explicitly focused on integrating text and color-coded point cloud features, thereby overlooking the rich high-level semantic relationships among objects.
In this paper, we propose a novel graph-based 3DQA method termed 3DGraphQA, which leverages scene graph reasoning to enhance the ability to handle complex reasoning tasks in 3DQA and offers stronger interpretability.
Specifically, our method first adaptively constructs dynamic scene graphs for the 3DQA task. Then we inject both the situation and the question inputs into the scene graph, forming the situation-graph and the question-graph, respectively.
Based on the constructed graphs, we finally perform intra- and inter-graph feature propagation for efficient graph inference: intra-graph feature propagation is performed based on Graph Transformer in each graph to realize single-modal contextual interaction and high-order contextual interaction; inter-graph feature propagation is performed among graphs based on bilinear graph networks to realize the interaction between different contexts of situations and questions.
Drawing on these intra- and inter-graph feature propagation, our approach is poised to better grasp the intricate semantic and spatial relationship issues among objects within the scene and their relations to the questions, thereby facilitating reasoning complex and compositional questions.
We validate the effectiveness of our approach on ScanQA and SQA3D datasets, and expand the SQA3D dataset to SQA3D Pro with multi-view information, making it more suitable for our approach. Experimental results demonstrate that our 3DGraphQA outperforms existing methods.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: We propose a novel graph-based 3DQA method, which exploits dynamic scene graphs to facilitate the 3DQA tasks.
We introduce a Graph Transformer-based model for intra-graph feature fusion, enabling contextual interactions between the scene objects and the question, and between the scene objects and the situation description.
We leverage the bilinear graph neural network for inter-graph feature fusion, which can enhance contextual interactions between different graphs.
We develop SQA3D Pro dataset, which is an extension of the SQA3D dataset with additional multi-view situation information, drawing inspiration from ScanQA dataset.
We conduct extensive experiments on two public benchmark datasets, i.e., SQA3D and ScanQA datasets. Experimental results show that our model outperforms all baseline methods.
Supplementary Material: zip
Submission Number: 4464
Loading