Abstract: Neural module network (NMN) based methods have shown promising performance in visual question answering (VQA). However, existing methods have overlooked the potential existence of multiple reasoning paths for a given question. They generate one reasoning path for a question, which restricts the diversity in module combinations. Additionally, these methods generate reasoning paths solely based on questions, neglecting visual cues, which may lead to sub-optimal paths in multi-step reasoning scenarios. In this paper, we introduce the Visual-Guided Neural Module Network (V-NMN), a neuro-symbolic method that integrates visual information to enhance the model’s reasoning capabilities. Specifically, we utilize the reasoning capability of large language models (LLM) to generate all feasible reasoning paths for the questions in a few-shot manner. Then, we assess the suitability of these paths for the image and select the optimal one based on the assessment. The final answer is derived by executing the reasoning process along the selected path. We evaluate our method on the GQA dataset and CX-GQA, a test set that requires multi-step reasoning. Experimental results demonstrate its effectiveness in real-world scenarios.
Loading