Two-Stream Heterogeneous Graph Network with Dynamic Interactive Learning for Video Question Answering
Abstract: Video question answering (VideoQA) challenges the joint learning of visual and linguistic knowledge. Whilst the dynamic video-question interaction is not well explored in previous methods, the search for answer clues from textual semantics in the interaction is not valued. To address these issues, this paper proposes a novel Two-Stream Heterogeneous Graph Network (TSHGNet) using Dynamic Interactive Learning (DIL) to accomplish effective reasoning between videos and questions. Inspired by the way people answer questions, the two-stream architecture of TSHGNet is designed to extract question-driven visual cues and video-driven textual semantics, respectively. Therein, for each stream, DIL gradually refines the comprehension of question semantics and the extraction of visual representations through heterogeneous graph interactions. Extensive experiments and qualitative analyses demonstrate improved performances of the proposed TSHGNet on three benchmark datasets in comparison with previous state-of-the-art methods, and the effectiveness of different components of our method.
Loading