Abstract: To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL,
a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension,
scene reasoning, and storyline recalling. In ROLL, each of these tasks is in
charge of extracting rich and diverse information by 1) processing scene
dialogues, 2) generating unsupervised video scene descriptions, and 3)
obtaining external knowledge in a weakly supervised fashion. To answer
a given question correctly, the information generated by each inspiredcognitive task is encoded via Transformers and fused through a modality
weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our
approach, which yields a new state-of-the-art on two challenging video
question answering datasets: KnowIT VQA and TVQA+.
0 Replies
Loading