Abstract: Audio-Visual Question Answering (AVQA) requires the model to answer questions with complex dynamic audio-visual information. Prior works on this task mainly consider only using single question-answer pairs during training, overlooking the rich semantic associations between questions. In this work, we propose a novel Collective Question-Guided Network (CoQo), which accepts multiple question-answer pairs as input and leverages the reasoning over these questions to assist the model training process. The core module is the proposed Question Guided Transformer (QGT), which uses collective question reasoning to perform question-guided feature extraction. Since multiple question-answer pairs are not always available, especially during inference, our QGT uses a set of learnable tokens to learn the collective information from multiple questions during training. At inference time, these learnable tokens bring additional reasoning information even when only one question is used as input. We employ QGT in both spatial and temporal dimensions to extract question-related features effectively and efficiently. To better capture detailed audio-visual associations, we train the model in a finer level by distinguishing feature pairs of different questions within the same video. Extensive experiments demonstrate that our method can achieve state-of-the-art performance on three AVQA datasets while reducing training time significantly. We also observe strong performances of our method on three VQA benchmarks. Detailed ablation studies further confirm the effectiveness of our proposed collective question reasoning scheme, both quantitatively and qualitatively.
External IDs:dblp:journals/ijcv/PeiHCXWWLQW25
Loading