FaST-3D: Integrating Fast and Slow Thinking for 3D Visual Question Answering

Published: 15 Nov 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dual process theory, cognitive science, 3D VQA (Visual Question Answering)
Abstract: 3D Visual Question Answering (VQA) is one of the most challenging tasks in the 3D Vision-Language (3D-VL) domain, as it requires not only precise interpretation of 3D environments but also effective reasoning over natural language questions. While existing approaches have conducted preliminary explorations in this area, they still suffer from significant issues such as limited robustness and an oversimplified treatment of VQA's inherently open-ended nature. To address these issues, we propose a novel FaST-3D agent based on dual process theory, which integrates two complementary reasoning systems: a fast-thinking system for rapid visual reasoning using representational memory and a slow-thinking system for detailed logical reasoning based on abstract memory. The fast system utilizes a Multimodal Large Language Model (MLLM) to quickly process visual input, enhanced by an adaptive image retriever and a confidence reflection module. In cases of low confidence, the slow system will be activated to invoke a well-designed toolset for step-by-step reasoning. Extensive experiments on ScanQA and OpenEQA datasets demonstrate that FaST-3D achieves SOTA performance across all metrics in zero-shot settings, particularly in the open-ended LLM-Match score. It effectively enhances model robustness across different scenarios and offers substantial practical value in embodied intelligence.
Submission Number: 23
Loading