Abstract: The proliferation of educational videos on the Internet has changed the educational landscape by enabling students to learn complex concepts at their own pace. Our work outlines the vision of an automated tutor – a multimodal QA system to answer questions from students watching a video. This can make doubt resolution faster and further improve learning experience. In this work, we take first steps towards building such a QA system. We curate and release a dataset named EduVidQA, with 3,158 videos and 18,474 QA-pairs. However, building and evaluating a QA system proves challenging, because (1) existing evaluation metrics do not correlate with human judgments, and (2) a student question could be answered in many different ways, and training on a single gold answer often confuses the model and makes it worse. We conclude with important research questions to develop this research area further.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, video processing, multimodality
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5416
Loading