Abstract: Visual Question Answering (VQA) raises a great challenge
for computer vision and natural language processing communities. Most of the existing approaches consider videoquestion pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task,
and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm
for VQA termed Multi-Question Learning (MQL). Inspired
by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a
target video sequence. The learned representations of videoquestion pairs are then more general to be transferred for new
questions. We further propose an effective VQA framework
and design a training procedure for MQL, where the specifically designed attention network models the relation between
input video and corresponding questions, enabling multiple
video-question pairs to be co-trained. Experimental results on
public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.
0 Replies
Loading