Abstract: Visual question answering (VQA) aims at answering
questions about the visual content of an image or a video.
Currently, most work on VQA is focused on image-based
question answering, and less attention has been paid into
answering questions about videos. However, VQA in video
presents some unique challenges that are worth studying:
it not only requires to model a sequence of visual features
over time, but often it also needs to reason about associated subtitles. In this work, we propose to use BERT, a
sequential modelling technique based on Transformers, to
encode the complex semantics from video clips. Our proposed model jointly captures the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer. In our experiments,
we exhaustively study the performance of our model by taking different input arrangements, showing outstanding improvements when compared against previous work on two
well-known video VQA datasets: TVQA and Pororo.
0 Replies
Loading