Collaborative Aware Bidirectional Semantic Reasoning for Video Question Answering

Published: 2025, Last Modified: 08 Jan 2026IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video question answering (VideoQA) is the challenging task of accurately responding to natural language questions based on a given video. Most previous methods focus on designing complex cross-modal interactions to perform question-oriented video scene mining and semantic reasoning, and utilize straightforward classification and matching strategies with different decoders to forcibly associate the predicted representation with ground-truth answer. However, the limitations of question-oriented reasoning and the overlapping semantic co-occurrences between questions and candidates may cause them to fall into spurious correlation reasoning. In this paper, we propose a Collaborative aware Bidirectional Semantic Reasoning (CBSR) model to alleviate this challenging problem. Specifically, we first propose a collaborative aware adaptive correlation reasoning module to collaboratively mine multi-granularity text-aware critical video scenes and reason about the complex intrinsic correlations between them via bottom-up cross-granularity adaptive aggregation. By progressively performing video reasoning from object-level to frame-level, we can obtain a set of semantically rich critical video representations. Then, we collaboratively decode it together with question and knowledge semantics into an implicit representation through the proposed unified answer semantic collaborated decoding module. Finally, a novel bidirectional semantic reasoning learning strategy is proposed to bridge and strengthen the unique positive semantic correlation between the learned implicit representation and the ground-truth answer, and explicitly alleviate the challenge of overlapping semantic co-occurrence. Benefiting from the same model structure and learning strategy, our method can achieve seamless transfer between Open-Ended and Multi-Choice tasks. Extensive experimental results on seven commonly tested datasets (i.e. MSVD-QA, MSRVTT-QA, NExT-QA, Causal-VidQA, NExT-OOD, ActivityNet-QA and EgoSchema) verify the superior performance of our method and the effectiveness of each reasoning module. We provide our source codes and experimental datasets at https://github.com/XizeWu/CBSR.
Loading