Temporal-based graph reasoning for Visual Commonsense Reasoning

Published: 2025, Last Modified: 23 Jan 2026Knowl. Based Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual Commonsense Reasoning (VCR) is to answer a question about the given image while providing a rationale that explains why the answer is true. Most studies have achieved remarkable performance by semantic alignment between the still image and answers. However, it is not trivial to answer temporal questions asking about some future moment, beyond the static content of the image. In this paper, we propose an Action-aware Temporal Graph Attention Network (ATGAN), in which temporal-oriented action reasoning is presented to infer the future action aligned with the answer. Since it is built on multiple actions, a verb-centric action segmentation module is designed to learn the importance distribution of key arguments associated with the verb and words surrounding the verb by discrete argument attention and continuous span attention, respectively. Additionally, we propose a question-guided visual extraction module to highlight visual objects relevant to the question via question commands and capture their relations in the image. Experimental results show that ATGAN outperforms strong baselines, especially for temporal questions. It improves performance by 2.88% and 2.50% on two subtasks, question answering and answer justification, respectively.
Loading